ACTOR: Notifying actor process about imminent migration to another host.

Question

Since a week, I am getting above message while running a task on our development account. As a result, the tasks can't finish and randomly starts a new build. This can happen multiple times during one run. Does anyone know why this is happening and how I can prevent this? It seems our production environment doesn't have this issue. Thank you!ACTOR: Notifying actor process about imminent migration to another host.
2023-05-11T09:11:24.953Z ACTOR: Pulling Docker image from repository.
2023-05-11T09:11:37.483Z ACTOR: Creating Docker container.
2023-05-11T09:11:39.771Z ACTOR: Starting Docker container.
2023-05-11T09:11:40.450Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp
2023-05-11T09:11:40.453Z Executing main command
2023-05-11T09:11:41.155Z (node:61) ExperimentalWarning: The Node.js specifier resolution flag is experimental. It could change or be removed at any time.
2023-05-11T09:11:41.158Z (Use node --trace-warnings ... to show where the warning was created)
2023-05-11T09:11:42.693Z INFO System info {"apifyVersion":"3.1.2","apifyClientVersion":"2.6.4","crawleeVersion":"3.3.0","osType":"Linux","nodeVersion":"v16.20.0"}

HonzaS · Answer

This is annoying but perfectly normal.https://docs.apify.com/academy/expert-scraping-with-apify/migrations-maintaining-state

Apeiron Insight · Answer

Thanks! We will need some state management then. How is it possible that this rarely happened before and now, since a week, multiple times in one task/run? Restarting docker/the scraper a couple of times while processing 100 requests is more expensive, no?

HonzaS · Answer

Longer the run means higher chance of migration.Crawlers have state management already implemented via request queue.

Apeiron Insight · Answer

It should automatically continue when it is migrated? If so, I need to check some things because this doesn't work on our side.

HonzaS · Answer

Depends on your code but if you have some common usecase with for example cheerio crawler with default request queue then yes. It should start with unhandled requests only after the migration.

Apeiron Insight · Answer

Thanks! I recently changed from using the request queue to context.crawler.addRequests(). Is it possible this has anything to do with it? Kind of around the same time, this started to happen.

Apeiron Insight · Answer

This is the log after the migration happened and it just keeps outputting the same:2023-05-11T09:18:43.720Z INFO PuppeteerCrawler: Status message: Crawled 64/undefined pages, 0 errors.
2023-05-11T09:18:43.811Z INFO PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.019},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.181},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-05-11T09:19:43.717Z INFO Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":106202,"requestsFinishedPerMinute":1,"requestsFailedPerMinute":0,"requestTotalDurationMillis":7115528,"requestsTotal":67,"crawlerRuntimeMillis":7394128,"retryHistogram

Apeiron Insight · Answer

I will try to use the normal request queue again and see if there is some improvement.

ApifyBot · Answer

just advanced to level 1! Thanks for your contributions! 🎉

HonzaS · Answer

this should be the same functionality, crawler.addRequests() adds requests to the default request queue of the run

Lukas Krivka · Answer

Your code has no impact on the migrations, it was probably just some performance issue on the Apify platform that it happened too much. The goal is to reduce these occurrences of course but it will always happen once a while as the jobs are moving actor jobs are moving across servers.

Honza Javorek · Answer

I'm getting this a lot these days and it kills my Scrapy actors. In the middle of a long and resource-demanding run, I get...
2024-11-21T09:39:16.476Z [scrapy.extensions.logstats] INFO Crawled 1015 pages (at 10 pages/min), scraped 645 items (at 4 items/min) ({"spider": "<Spider 'xyz' at 0x7f940ecefb60>"})
2024-11-21T09:39:25.684Z ACTOR: Notifying Actor process about imminent migration to another host.
2024-11-21T09:39:40.956Z ACTOR: Run was migrated to a new host.
...and then it just starts over! That's a bit... infuriating. All is lost, the scraper starts from scratch, drains more resources, and then - obviously - timeout.I'm not sure what can I do about this. Is it an issue of the Scrapy integration? Because albeit I have some custom code, I don't touch anything near request queues, that's all done by the official Apify/Scrapy integration. @Vlada Dusek

Honza Javorek · Answer

I don't have Restart on error checked.

HonzaS · Answer

If you use some of the crawlee crawlers then migration problems should be already solved in the library.
If not, then you need to take care of that in your own code and persist the state so the crawler can start from the point before the restart.

Honza Javorek · Answer

But I’m using the official Scrapy - Apify integration. Handling of this feels like something which should happen under the level of my application code. In Scrapy I don’t get to touch the request queue at all.

Honza Javorek · Answer

This is very platform specific and how I understand the integration, its purpose is that I write Scrapy code, business as usual, and the integration takes care of the specifics of the platform. E.g. it automatically applies the proxy settings, implements the request queue, etc. It makes no sense if I was supposed to solve such low level stuff myself.

Pombaer · Answer

Hi there, the migration process is working fine. But after the start over the crawler reboots and the request queue is filled again with the same urls.... As it adds the urls on every start to the request queue.Right now I am doing the following to clear the request queue: Actor.on("migrating", async () => { await requestQueue.drop(); });
This prevents me from getting duplicated data in the storage as I clear and open the storage before every crawl. So I start fresh on every crawl as well after migrating process. However, it doesn't feel like the best solution to me. The best would be that requestqueue and storage keep the state after the migration process. What approach can I take here?

HonzaS · Answer

@Honza Javorek I am not using Python but it looks like that the official Scrapy - Apify integration just allow you to run the scrapy project on the platform but nothing more, so no state persistence. In that case you need to take care of that on your own
@Pombaer RequestQueue is by default automatically deduplicating, you cannot add same url twice and it of course keeps its state during the migration. So the already handled requests are not processed again.

Apify and Crawlee Official Forum

ACTOR: Notifying actor process about imminent migration to another host.