Workflow for manually reprocessing requests when using ...

sstrajk

Use case: I'm debugging a crawler. Majority of request handlers succeed, only few fail. I wanna fix/adjust the request handler logic; get from logs failed urls; open some SQLite editor; find those requests in request queue table and somehow mark them as unprocessed. Then rerun the crawler with CRAWLEE_PURGE_ON_START=false so it only run the previously problematic urls. Iterate few times to catch all bugs, and then run the whole crawler with purged storage.

After lot of debugging/investigating Crawlee & @Apify/storage-local I've managed to figure out a working workflow, but it's kinda laborious:

set row's orderNo to some future date in ms from epoch
edit rows' json and remove handledAt property [2]
run the crawler, which will re-add handledAt property
delete row's orderNo (not sure why that is not done automatically)

That's kinda tedious, do you know of some better way? Or is there some out-of-the-approach for my usecase without hacking SQLite? I've found out this approach recommended by one-and-only @Lukas Krivka here 🙂 https://github.com/apify/crawlee/discussions/1232#discussioncomment-1625019

[1]
https://github.com/apify/apify-storage-local-js/blob/8dd40e88932097d2260f68f28412cc29ff894e0f/src/emulators/request_queue_emulator.ts#L341
[2]
https://github.com/apify/crawlee/blob/52b98e3e997680e352da5763b394750b19110953/packages/core/src/storages/request_queue.ts#L164

Attachment

12 comments

AApifyBot

just advanced to level 2! Thanks for your contributions! 🎉

sstrajk

daringly taggin also as SQLite specialist :)) (really nice work with the local storage adapter 🙏 )

vvladdy

Can't you just update the request with handledAt: null and forefront: true/false

vvladdy

👀👀

sstrajk

IIUC nope cause of this

Attachment

sstrajk

It's not a huge issue, but it's kinda akward so I was wondering if I'm approaching it with wrong mental model 🙂

HHonzaS

cant you just put those failed request to another queue and after start to have some logic that will use that queue if there is one and not empty?

vvladdy

Oof yeah that's... wack. Although updating shouldn't even hit that statement

vvladdy

Yeah no updating should let you set orderNo too

vvladdy

So calling storage.updateRequest should let you set orderNo and handledAt: null

sstrajk

For now, solved it in a very hacky, but convenient way :))
Thx for the idea

Crawler options

Plain Text

failedRequestHandler: async (context, error) => {
  if (os.platform() === `darwin`) {
    const {request} = context
    fs.writeFileSync(debugFilePath, JSON.stringify(request, null, 2))
    console.log(`Stored failed request to ${debugFilePath}, if not deleted, it will be used for request list the next time you run the actor.`)
  }
}

During my custom init

Plain Text

if (os.platform() === `darwin`) {
  const files = fs.readdirSync(debugDir).filter(file => file.endsWith(`.json`))
  // for (const file of files) {
  //   const request = JSON.parse(fs.readFileSync(filePath, `utf8`))
  await crawler.run(requests)
}

sstrajk

hah 😄 https://github.com/apify/crawlee/issues/1363#issuecomment-1485877274

Add a reply

Join on Discord

Apify and Crawlee Official Forum

Workflow for manually reprocessing requests when using @apify/storage-local for SQLite Request Queue