strajk

How to use MemoryStorage (mainly for RequestQueue) on the platform?

My actor runs typically use Cheerio, take <20m and have around 1k requests.
For this scenario, costs for RequestQueue writes/reads are often higher than compute units. I wanted to experiment with using in-memory storage to optimize costs (i think I understand the associated risks and I'm ok with them).
I've tried setting storage: new MemoryStorage() in Actor.main second argument as noted in docs & TS definitions, but actor runs on platform still seems to use "platform RQ", not "in-memory one". Any pointers?

https://console.apify.com/actors/64sLcqgxq4IB5hZrI/runs/QIkqpBDa846Ftn5xK#storage

3 comments

sstrajk

Best practice to stop/crash the actor/crawler on high ratio of errors?

Following snippet works well for me, but it smells... sb have a cleaner approach?

Plain Text

// Every 3s, check for the ratio of finished (=success) and failed requests and stop the process if it's too bad
setInterval(() => {
  const { requestsFinished, requestsFailed } = crawler.stats.state
  if (requestsFailed > requestsFinished + 10) { // when failed 10 more than finished, stop trying bro
    console.warn(`💣 Too many failed requests, stopping! (${requestsFailed} failed, ${requestsFinished} finished)`)
    process.exit(1)
  }
}, 3000)

3 comments

sstrajk

skipNavigation per route label, instead of manually adding it to each request with given label

Use case: When using Cheerio, JSDOM, LinkeDOM crawlers and their routers. I often wanna automatically request+parse all the route handlers except one.
ATM I have to remember to specify skipNavigation at every point of adding the request to request queue. (IIUC)

Just food for thought, not urgent 🙂

3 comments

sstrajk

Workflow for manually reprocessing requests when using @apify/storage-local for SQLite Request Queue

Use case: I'm debugging a crawler. Majority of request handlers succeed, only few fail. I wanna fix/adjust the request handler logic; get from logs failed urls; open some SQLite editor; find those requests in request queue table and somehow mark them as unprocessed. Then rerun the crawler with CRAWLEE_PURGE_ON_START=false so it only run the previously problematic urls. Iterate few times to catch all bugs, and then run the whole crawler with purged storage.

After lot of debugging/investigating Crawlee & @Apify/storage-local I've managed to figure out a working workflow, but it's kinda laborious:

set row's orderNo to some future date in ms from epoch
edit rows' json and remove handledAt property [2]
run the crawler, which will re-add handledAt property
delete row's orderNo (not sure why that is not done automatically)

That's kinda tedious, do you know of some better way? Or is there some out-of-the-approach for my usecase without hacking SQLite? I've found out this approach recommended by one-and-only @Lukas Krivka here 🙂 https://github.com/apify/crawlee/discussions/1232#discussioncomment-1625019

[1]
https://github.com/apify/apify-storage-local-js/blob/8dd40e88932097d2260f68f28412cc29ff894e0f/src/emulators/request_queue_emulator.ts#L341
[2]
https://github.com/apify/crawlee/blob/52b98e3e997680e352da5763b394750b19110953/packages/core/src/storages/request_queue.ts#L164

12 comments

Apify Discord Mirror

How to use MemoryStorage (mainly for RequestQueue) on the platform?

Best practice to stop/crash the actor/crawler on high ratio of errors?

skipNavigation per route label, instead of manually adding it to each request with given label

Workflow for manually reprocessing requests when using @apify/storage-local for SQLite Request Queue