Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
strajk
s
strajk
Offline, last seen last month
Joined August 30, 2024
Following snippet works well for me, but it smells... sb have a cleaner approach?

Plain Text
// Every 3s, check for the ratio of finished (=success) and failed requests and stop the process if it's too bad
setInterval(() => {
  const { requestsFinished, requestsFailed } = crawler.stats.state
  if (requestsFailed > requestsFinished + 10) { // when failed 10 more than finished, stop trying bro
    console.warn(`πŸ’£ Too many failed requests, stopping! (${requestsFailed} failed, ${requestsFinished} finished)`)
    process.exit(1)
  }
}, 3000)
3 comments
A
s
H
My actor runs typically use Cheerio, take <20m and have around 1k requests.
For this scenario, costs for RequestQueue writes/reads are often higher than compute units. I wanted to experiment with using in-memory storage to optimize costs (i think I understand the associated risks and I'm ok with them).
I've tried setting storage: new MemoryStorage() in Actor.main second argument as noted in docs & TS definitions, but actor runs on platform still seems to use "platform RQ", not "in-memory one". Any pointers?

https://console.apify.com/actors/64sLcqgxq4IB5hZrI/runs/QIkqpBDa846Ftn5xK#storage
3 comments
L
s
Following snippet works well for me, but it smells... sb have a cleaner approach?

Plain Text
// Every 3s, check for the ratio of finished (=success) and failed requests and stop the process if it's too bad
setInterval(() => {
  const { requestsFinished, requestsFailed } = crawler.stats.state
  if (requestsFailed > requestsFinished + 10) { // when failed 10 more than finished, stop trying bro
    console.warn(`πŸ’£ Too many failed requests, stopping! (${requestsFailed} failed, ${requestsFinished} finished)`)
    process.exit(1)
  }
}, 3000)
3 comments
A
s
H
My actor runs typically use Cheerio, take <20m and have around 1k requests.
For this scenario, costs for RequestQueue writes/reads are often higher than compute units. I wanted to experiment with using in-memory storage to optimize costs (i think I understand the associated risks and I'm ok with them).
I've tried setting storage: new MemoryStorage() in Actor.main second argument as noted in docs & TS definitions, but actor runs on platform still seems to use "platform RQ", not "in-memory one". Any pointers?

https://console.apify.com/actors/64sLcqgxq4IB5hZrI/runs/QIkqpBDa846Ftn5xK#storage
3 comments
s
L
Use case: When using Cheerio, JSDOM, LinkeDOM crawlers and their routers. I often wanna automatically request+parse all the route handlers except one.
ATM I have to remember to specify skipNavigation at every point of adding the request to request queue. (IIUC)

Just food for thought, not urgent πŸ™‚
3 comments
s
L
Use case: When using Cheerio, JSDOM, LinkeDOM crawlers and their routers. I often wanna automatically request+parse all the route handlers except one.
ATM I have to remember to specify skipNavigation at every point of adding the request to request queue. (IIUC)

Just food for thought, not urgent πŸ™‚
3 comments
s
L
Use case: I'm debugging a crawler. Majority of request handlers succeed, only few fail. I wanna fix/adjust the request handler logic; get from logs failed urls; open some SQLite editor; find those requests in request queue table and somehow mark them as unprocessed. Then rerun the crawler with CRAWLEE_PURGE_ON_START=false so it only run the previously problematic urls. Iterate few times to catch all bugs, and then run the whole crawler with purged storage.

After lot of debugging/investigating Crawlee & @Apify/storage-local I've managed to figure out a working workflow, but it's kinda laborious:
  • set row's orderNo to some future date in ms from epoch
  • edit rows' json and remove handledAt property [2]
  • run the crawler, which will re-add handledAt property
  • delete row's orderNo (not sure why that is not done automatically)
That's kinda tedious, do you know of some better way? Or is there some out-of-the-approach for my usecase without hacking SQLite? I've found out this approach recommended by one-and-only @Lukas Krivka here πŸ™‚ https://github.com/apify/crawlee/discussions/1232#discussioncomment-1625019

[1]
https://github.com/apify/apify-storage-local-js/blob/8dd40e88932097d2260f68f28412cc29ff894e0f/src/emulators/request_queue_emulator.ts#L341
[2]
https://github.com/apify/crawlee/blob/52b98e3e997680e352da5763b394750b19110953/packages/core/src/storages/request_queue.ts#L164
12 comments
s
v
H
A
Use case: I'm debugging a crawler. Majority of request handlers succeed, only few fail. I wanna fix/adjust the request handler logic; get from logs failed urls; open some SQLite editor; find those requests in request queue table and somehow mark them as unprocessed. Then rerun the crawler with CRAWLEE_PURGE_ON_START=false so it only run the previously problematic urls. Iterate few times to catch all bugs, and then run the whole crawler with purged storage.

After lot of debugging/investigating Crawlee & @Apify/storage-local I've managed to figure out a working workflow, but it's kinda laborious:
  • set row's orderNo to some future date in ms from epoch
  • edit rows' json and remove handledAt property [2]
  • run the crawler, which will re-add handledAt property
  • delete row's orderNo (not sure why that is not done automatically)
That's kinda tedious, do you know of some better way? Or is there some out-of-the-approach for my usecase without hacking SQLite? I've found out this approach recommended by one-and-only @Lukas Krivka here πŸ™‚ https://github.com/apify/crawlee/discussions/1232#discussioncomment-1625019

[1]
https://github.com/apify/apify-storage-local-js/blob/8dd40e88932097d2260f68f28412cc29ff894e0f/src/emulators/request_queue_emulator.ts#L341
[2]
https://github.com/apify/crawlee/blob/52b98e3e997680e352da5763b394750b19110953/packages/core/src/storages/request_queue.ts#L164
12 comments
s
v
H
A