Apify

Apify and Crawlee Official Forum

b
F
A
J
A

How can I pass data extracted in the first part of the scraper to items that will be extracted later

Hi. I'm extracting prices of products. In the process, I have the main page where I can extract all the information I need except for the fees. If I go through every product individually, I can get the price and fees, but sometimes I lose the fee information because I get blocked on some products. I want to handle this situation. If I extract the fees, I want to add them to my product_item, but if I get blocked, I want to pass this data as empty. I'm using the "Router" class as the Crawlee team explains here: https://crawlee.dev/python/docs/introduction/refactoring. When I add my URL extracted from the first page as shown below, I cannot pass data extracted before:

await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES')

I want something like this:

await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES', data=product_item # type: dict)

But I cannot do the above. How can I do it?

So, my final data will be showed as:

If I handle the data correctly I want something like this:
product_item = {product_id: 1234, price: 50$, fees: 3$}

If I get blocked, I have something like this:
product_item = {product_id: 1234, price: 50$, fees: ''}
M
f
4 comments
Hi @frankman

You can use this approach

Plain Text
await context.add_requests([
    Request.from_url(
            url='product_url',
            label='PRODUCT_WITH_FEES',
            user_data={"product_item": product_item}
            )
    ])


enqueue_links - It also supports the user_data variable, but it seems to me that add_requests is better for your case
Thank you Mantisus, that works for me. Now I know how can I pass data between requests. And how can I handle the data upload depending on whether the request failed or was successful?


If I handle the data correctly I want something like this:
product_item = {product_id: 1234, price: 50$, fees: 3$}

If I get blocked, I have something like this:
product_item = {product_id: 1234, price: 50$, fees: ''}

In my final function with the "label": PRODUCT_WITH_FEES I'm using Apify.push(product_item) (same than crawlee.push()).

I have to do the following way?

try: ... await context.add_requests([ Request.from_url( url='product_url', label='PRODUCT_WITH_FEES', user_data={"product_item": product_item} ) ]) except Exception as e: Apify.push(produc_item) # product_item without fees.

??
I can't be certain as I don't know exactly what behavior you are observing. But it's more likely to be something like this

Plain Text
@crawler.failed_request_handler
async def blocked_item_handle(context, error) -> None:
    if context.request.label == "PRODUCT_WITH_FEES":
        Apify.push(context.request.produc_item)

https://crawlee.dev/python/api/class/BasicCrawler#failed_request_handler

Either at the try ... except in the route for PRODUCT_WITH_FEES
Thank you, that works fine!
Add a reply
Sign up and join the conversation on Discord
Join