Apify and Crawlee Official Forum

Updated 2 months ago

Clear URL queue at end of run?

At a glance
I'm a data reporter at CBS News using crawlee to archive web pages. Currently, when I finish a crawl, the next crawl continues to crawl pages enqueued in the previous crawl.

Is there an easy fix to this? I've looked at the docs, specifically the persist_storage and purge_on_start parameters, but it's unclear from the documentation what exactly those do.

Happy to provide a code sample if helpful
C
M
A
9 comments
If anyone comes across this post, I think I understand what's happening now - if crawlee hits the max number of requests defined in max_requests_per_crawl, it stops making requests but doesn't clear the request queue, so if you're running enqeue you'll end up with more pages in the queue.
Hi, are you by any chance using Jupiter Notebook when working with crawlee?

Since the behavior you describe corresponds to purge_on_start=False
That is, it reaches the max_requests_per_crawl limit and aborts, but after starting it continues where it left off, since the queue is not cleared.

But if you are working with Jupiter Notebook, the queue and cache stored in memory are not cleared without session termination.
Nope this is running inside a django app I'm building, not a notebook
I managed to get it to crawl the correct pages by not setting a limit of requests but rather limiting the crawl depth
However, i'm now seeing an issue where the crawler is refusing to crawl a page that it's previously crawled, not clear why
If the crawler is to crawl the same page, you must pass unique_key

example:

Plain Text
async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    request_1 = Request.from_url("https://httpbin.org/get", unique_key="1")
    request_2 = Request.from_url("https://httpbin.org/get", unique_key="2")
    request_3 = Request.from_url("https://httpbin.org/get", unique_key="3")

    await crawler.run(
        [
            request_1,
            request_2,
            request_3
        ]
    )
Boom that worked for me, thanks so much for the help
@Chris just advanced to level 1! Thanks for your contributions! πŸŽ‰
For posterity if anyone comes across this thread, I had to provide a unique_key (I used a uuid because I want them to be crawled every time) to the Request object AND in the user_data argument to enqueue_links
Add a reply
Sign up and join the conversation on Discord