Apify Discord Mirror

Updated 3 weeks ago

Website Content Crawler Actor - Get access to failed urls

Hi,
I am using the actor Website Content Crawler (apify/website-content-crawler) to scrape a few thousand urls. These are a predecided list of a urls, so the depth is set to 0. I saw a few of these fail. Is there any way to get access to these failed urls from either the apify site UI or the integration? The Dataset under Storage only gives the successful urls.
M
V
P
4 comments
Hello! Unfortunately, I can't find any option for that. I see that an issue has already being opened for the actor, I was going to suggest that. In the meantime, if you have a fixed list of URLs, you could compare it to the Actor's output, but whether it would be acceptable depends on your use-case.
Hey @Marco , yeah we've written a local script to do it the comparison and extract. I was hoping if that data was release in the Dataset, we could integrate it easily with Google Sheet, so we could just copy those and run a rebound for the failed ones. This would allow us to run multipe instances of the actor without having to track and run the failed script for each set with the respective input url list.
If this was being done for a crawler based scraping instead of fixed list, it would be a much bigger challenge.
I got an email suggesting another actor retry failed urls but that's not the only way we intend to use it. It would be minor but impactful change to have it as part of the actor itself.
There's also another issue of the actor starting off with less RAM usage but maxing out (16 GB) after a few hours of runtime running up our bills.
Hi @Vipul ,
You may describe your use-case in the Actor's Issues tab and ask to implement such a feature (possibly with extra option on the Input).

There's also another issue of the actor starting off with less RAM usage but maxing out (16 GB) after a few hours of runtime running up our bills.
If you start the Actor on Apify Platform with certain amount of RAM the consumption for running should be always fixed no matter how much RAM is currently in use.

If you have specific Run-related problem (like the consumption is increasing and the Run fails on OutOfMemory), please raise an Issue in Actor's Issues tab.
Add a reply
Sign up and join the conversation on Discord