Apify and Crawlee Official Forum

Updated 11 months ago

Workflow for manually reprocessing requests when using @apify/storage-local for SQLite Request Queue

Use case: I'm debugging a crawler. Majority of request handlers succeed, only few fail. I wanna fix/adjust the request handler logic; get from logs failed urls; open some SQLite editor; find those requests in request queue table and somehow mark them as unprocessed. Then rerun the crawler with CRAWLEE_PURGE_ON_START=false so it only run the previously problematic urls. Iterate few times to catch all bugs, and then run the whole crawler with purged storage.

After lot of debugging/investigating Crawlee & @Apify/storage-local I've managed to figure out a working workflow, but it's kinda laborious:
  • set row's orderNo to some future date in ms from epoch
  • edit rows' json and remove handledAt property [2]
  • run the crawler, which will re-add handledAt property
  • delete row's orderNo (not sure why that is not done automatically)
That's kinda tedious, do you know of some better way? Or is there some out-of-the-approach for my usecase without hacking SQLite? I've found out this approach recommended by one-and-only @Lukas Krivka here πŸ™‚ https://github.com/apify/crawlee/discussions/1232#discussioncomment-1625019

[1]
https://github.com/apify/apify-storage-local-js/blob/8dd40e88932097d2260f68f28412cc29ff894e0f/src/emulators/request_queue_emulator.ts#L341
[2]
https://github.com/apify/crawlee/blob/52b98e3e997680e352da5763b394750b19110953/packages/core/src/storages/request_queue.ts#L164
Attachment
Screen_2024-01-03_at_21.06.08.png
1
A
s
v
12 comments
just advanced to level 2! Thanks for your contributions! πŸŽ‰
daringly taggin also as SQLite specialist :)) (really nice work with the local storage adapter πŸ™ )
Can't you just update the request with handledAt: null and forefront: true/false
πŸ‘€πŸ‘€
IIUC nope cause of this
Attachment
image.png
It's not a huge issue, but it's kinda akward so I was wondering if I'm approaching it with wrong mental model πŸ™‚
cant you just put those failed request to another queue and after start to have some logic that will use that queue if there is one and not empty?
Oof yeah that's... wack. Although updating shouldn't even hit that statement
Yeah no updating should let you set orderNo too
So calling storage.updateRequest should let you set orderNo and handledAt: null
For now, solved it in a very hacky, but convenient way :))
Thx for the idea

Crawler options
Plain Text
failedRequestHandler: async (context, error) => {
  if (os.platform() === `darwin`) {
    const {request} = context
    fs.writeFileSync(debugFilePath, JSON.stringify(request, null, 2))
    console.log(`Stored failed request to ${debugFilePath}, if not deleted, it will be used for request list the next time you run the actor.`)
  }
}


During my custom init
Plain Text
if (os.platform() === `darwin`) {
  const files = fs.readdirSync(debugDir).filter(file => file.endsWith(`.json`))
  // for (const file of files) {
  //   const request = JSON.parse(fs.readFileSync(filePath, `utf8`))
  await crawler.run(requests)
}
Add a reply
Sign up and join the conversation on Discord