Thanks! We will need some state management then. How is it possible that this rarely happened before and now, since a week, multiple times in one task/run? Restarting docker/the scraper a couple of times while processing 100 requests is more expensive, no?
Longer the run means higher chance of migration.
Crawlers have state management already implemented via request queue.
It should automatically continue when it is migrated? If so, I need to check some things because this doesn't work on our side.
Depends on your code but if you have some common usecase with for example cheerio crawler with default request queue then yes. It should start with unhandled requests only after the migration.
Thanks! I recently changed from using the request queue to context.crawler.addRequests(). Is it possible this has anything to do with it? Kind of around the same time, this started to happen.
This is the log after the migration happened and it just keeps outputting the same:
2023-05-11T09:18:43.720Z INFO PuppeteerCrawler: Status message: Crawled 64/undefined pages, 0 errors.
2023-05-11T09:18:43.811Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.019},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.181},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-05-11T09:19:43.717Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":106202,"requestsFinishedPerMinute":1,"requestsFailedPerMinute":0,"requestTotalDurationMillis":7115528,"requestsTotal":67,"crawlerRuntimeMillis":7394128,"retryHistogram
I will try to use the normal request queue again and see if there is some improvement.
just advanced to level 1! Thanks for your contributions! 🎉
this should be the same functionality, crawler.addRequests()
adds requests to the default request queue of the run
Your code has no impact on the migrations, it was probably just some performance issue on the Apify platform that it happened too much. The goal is to reduce these occurrences of course but it will always happen once a while as the jobs are moving actor jobs are moving across servers.
I'm getting this a lot these days and it kills my Scrapy actors. In the middle of a long and resource-demanding run, I get
...
2024-11-21T09:39:16.476Z [scrapy.extensions.logstats] INFO Crawled 1015 pages (at 10 pages/min), scraped 645 items (at 4 items/min) ({"spider": "<Spider 'xyz' at 0x7f940ecefb60>"})
2024-11-21T09:39:25.684Z ACTOR: Notifying Actor process about imminent migration to another host.
2024-11-21T09:39:40.956Z ACTOR: Run was migrated to a new host.
...
and then it just starts over! That's a bit... infuriating. All is lost, the scraper starts from scratch, drains more resources, and then - obviously - timeout.
I'm not sure what can I do about this. Is it an issue of the Scrapy integration? Because albeit I have some custom code, I don't touch anything near request queues, that's all done by the official Apify/Scrapy integration. @Vlada Dusek
I don't have Restart on error
checked.
If you use some of the crawlee crawlers then migration problems should be already solved in the library.
If not, then you need to take care of that in your own code and persist the state so the crawler can start from the point before the restart.
But I’m using the official Scrapy - Apify integration. Handling of this feels like something which should happen under the level of my application code. In Scrapy I don’t get to touch the request queue at all.
This is very platform specific and how I understand the integration, its purpose is that I write Scrapy code, business as usual, and the integration takes care of the specifics of the platform. E.g. it automatically applies the proxy settings, implements the request queue, etc. It makes no sense if I was supposed to solve such low level stuff myself.
Hi there,
the migration process is working fine.
But after the start over the crawler reboots and the request queue is filled again with the same urls.... As it adds the urls on every start to the request queue.
Right now I am doing the following to clear the request queue:
Actor.on("migrating", async () => {
await requestQueue.drop();
});
This prevents me from getting duplicated data in the storage as I clear and open the storage before every crawl. So I start fresh on every crawl as well after migrating process.
However, it doesn't feel like the best solution to me. The best would be that requestqueue and storage keep the state after the migration process. What approach can I take here?
@Honza Javorek I am not using Python but it looks like that the official Scrapy - Apify integration just allow you to run the scrapy project on the platform but nothing more, so no state persistence. In that case you need to take care of that on your own
@Pombaer RequestQueue is by default automatically deduplicating, you cannot add same url twice and it of course keeps its state during the migration. So the already handled requests are not processed again.