Crawlee for Python Forum

This is so that Playwright can fill in and submit a website search page which uses dynamic Javascript. When the results are shown I want to be able to use the BeautifulSoup crawler to open each product page and parse the information. If I use Playwright to open each product page, this takes a very long time. I cannot seem to run both Crawlers at the same time.

7 comments

BBilly Richer

Using-proxies-with-playwright-and-best-proxy-service-providers

How to use proxies with Playwright? And, what are the best proxy service providers? Note that I'm new to web-scraping and I'm using Crawlee python

1 comment

PPooya Khoshbakht

Parallel Scraping

Does anyone know when Parallel Scraping and Request Locking are comming to the python version?

1 comment

ttaintedgamer4k

Pydantic Exception

I'm building a request queue of URLs and most run fine but I will receive the following exception and not sure how to proceed.

pydantic_core._pydantic_core.ValidationError: 1 validation error for Request
user_data.__crawlee.state
Input should be 0, 1, 2, 3, 4, 5, 6 or 7 [type=enum, input_value='RequestState.REQUEST_HANDLER', input_type=str]
For further information visit https://errors.pydantic.dev/2.9/v/enum

2 comments

ffrankman

How to send post request (I'm doing reverse engineering)

I'm conducting reverse engineering and have discovered a link that retrieves all the data I need using the POST method. I've copied the request as cURL to analyze the parameters required for making the correct request.

I've modified the parameters to make the request using the POST method. I've successfully tested this using httpx, but now I want to implement it using the Crawlee framework.

How can I change the method used by the HTTP client to retrieve the data, and how can I pass the modified parameters I've prepared?

Additionally, if anyone has experience, I'd appreciate any insights on handling POST requests within this framework.

Thanks

1 comment

GGraham

Need Some Help On How To Send Datasets Automatically To Actors

Hey Guys, been trying to figure out how to integrate the following actors: 1. compass/crawler-google-places 2. vdrmota/contact-info-scraper 3. lukaskrivka/dedup-datasets.

What Im Looking To Do:

Google Maps Scraper runs and the results from the dataset go to the ContactScraper to enrich the data (DONE Since they integrate using the APIFY INTEGRATION).
After point 1. I have two datasets (one from Google Maps Scraper and the other from the Contact Scraper). I want to reference these datasets in the third actor (Dedup-Datasets) ,so it will merge/match the data into a clean output.

The issues are since the Dedup-Dataset Actor's input requires the numerical datasets IDs ( wont seem to accept variables for the data set)... This means I need to manually take the two datasets after each run and enter them into the third actor. I would like the variables from the two actor datasets to seamlessly pass to the third actor but cant really find a good workaround.

Help Would Be Appreciated!

PPooya Khoshbakht

How to re visit a url that is already scraped?

Hi
I'm making a simple app that gets updated information from a website.
This is inside a fastapi app and It uses AsyncIOScheduler to run the script every day,
The issue is since the crawl is already visited the main page, for the next call, It will not re visit the page.
I've did a lot of research but couldn't find a solution, other scrapers has someting like force= parameter to force the scrape.
How can we fource the UNPROCESSED to the request?
Here is the code

Plain Text

class Scraper:
    async def run_scraper(self):
        proxy_urls = process_proxy_file('proxy_list.txt')
        proxy_configuration = ProxyConfiguration(proxy_urls=proxy_urls)
        crawler = PlaywrightCrawler(proxy_configuration=proxy_configuration, headless=False, browser_type='chromium')

        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            print('Handling request...')
            context.request.state(RequestState.UNPROCESSED)

            # Scrape logic here
            # Return scraped data if needed

        request = Request.from_url('https://crawlee.dev')
        await crawler.run([request])
        return "Example Scraped Data"

1 comment

ddevil-port

limit extraction for free plan users

I habe built an Instagram profile scraper, in python, but want to limit the scraping result to 25 for those free plan users not paid plan users.
Can anybody help me out?

2 comments

LLuca

Apify Reel Scraper is not scraping correctly.

JJourdelune

infinite_scroll | how to get the updated page

Hey, I created this simple script:

Plain Text

import asyncio

# Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript.
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
import urllib.parse


async def main(terms: str) -> None:
    crawler = PlaywrightCrawler(headless=False)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Wait for the collection cards to render on the page. This ensures that
        # the elements we want to interact with are present in the DOM.
        await context.page.wait_for_load_state("networkidle")
        await context.infinite_scroll()

    url = f"https://www.youtube.com/results?search_query={urllib.parse.quote(terms)}&sp=EgIwAQ%253D%253D"
    await crawler.run([url])


if __name__ == "__main__":
    asyncio.run(main("music"))

`

But I want to get the content of the page while infinite_scroll scroll the page, like that I can see the new content and I can make action according to them, but await context.infinite_scroll() never stop so I can't put an action behind it to run the thinkg I want, how can I manage that? (I want to get the new link of youtube video)

2 comments

TTomáš Linhart

Disable persistant storage

Hi guys, I plan to deploy my crawler (a ParselCrawler one) to AWS Lambda. I'm loosely following this guide which is for JavaScript though. I'd like disable persising the storage. I change the configuration like this:

Plain Text

config = Configuration.get_global_configuration()
config.persist_storage = False

and also tried to supply the configuration to the ParselCrawler constructor, neither of these works though. I'm still getting the storage directory created. Am I doing something wrong here?
Thanks

1 comment

ffrankman

How can I pass data extracted in the first part of the scraper to items that will be extracted later

Hi. I'm extracting prices of products. In the process, I have the main page where I can extract all the information I need except for the fees. If I go through every product individually, I can get the price and fees, but sometimes I lose the fee information because I get blocked on some products. I want to handle this situation. If I extract the fees, I want to add them to my product_item, but if I get blocked, I want to pass this data as empty. I'm using the "Router" class as the Crawlee team explains here: https://crawlee.dev/python/docs/introduction/refactoring. When I add my URL extracted from the first page as shown below, I cannot pass data extracted before:

await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES')

I want something like this:

await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES', data=product_item # type: dict)

But I cannot do the above. How can I do it?

So, my final data will be showed as:

If I handle the data correctly I want something like this:
product_item = {product_id: 1234, price: 50$, fees: 3$}

If I get blocked, I have something like this:
product_item = {product_id: 1234, price: 50$, fees: ''}

4 comments

TTianTian

Regarding "tweets mentioning me", I only want to retrieve tweets where users directly mention me

Hi. I only want to retrieve tweets where users directly mention me in the tweet, not tweets that mention me because they are replies to tweets that mentioned me. Can you do this? For more details on this issue, please refer to: https://devcommunity.x.com/t/how-to-differentiate-direct-reply-and-mentions/149262.

Thanks.

3 comments

kkillian

How can i use proxy with playwright apify

hi I'm trying to make a scraper and i don't now how to implement a proxy hosted by apify in my script i share you a code to see why i'm trying to do

2 comments

NNyanmaru

Error adding request to the queue: Request ID does not match its unique_key.

Hi and good day. I'm creating a POST API that access the following JSON body:
{
"url": "https://crawlee.dev/python/",
"targets": ["html", "pdf"]
}

Inside the list of targets, is the extension which my code downloads if it discovers.

I'm already at my wit's end since I don't get the error I'm getting which is:

[crawlee.memory_storage_client._request_queue_client] WARN  Error adding request to the queue: Request ID does not match its unique_key.

Does anyone encountered this problem?

The following is my whole code:

9 comments

NNyanmaru

Encountering NotImplementedError when integrating Crawlee code with REST

Crawlee Code works when I put the URL directly inside the crawlee.run([url]), but when I put the code inside an endpoint and execute the URL on postman error NotImplementedError shows.

How to use crawlee w/ FastAPI?

BBenito

Blog Apify + Qdrant + AWS.

I am interested in writing a blog to use Apify ad Qdrant. I would like to know if this would create any costs as i am new into Apify. The idea is to combine Apify with Qdrant and AWS Services, create an App and showcase it in Medium and any blog where Apify is intereted. Would love to get in contact with anyone from Apify to discuss this opportunity.

Mainly I am interested in scraping FAQ websites from AWS

https://aws.amazon.com/sagemaker/faqs/

2 comments

cclaude

The python client reference webpage is blank

CCodesmartass

Python scrpping

Someone is wanted to tell me how to send mouse event to unactivated window in windows system

TTrvpSodv

Will crawlee scrape data loaded from JS that is triggered by scroll?

Looking for a solution for scraping a website that fills in product details via JS on scroll.

Will crawlee “scroll” and scrape after these loads or will this require other means?

Thank you!

ggrindman

Memory is critically overloaded

[crawlee._autoscaling.snapshotter] WARN Memory is critically overloaded. Using 2.54 GB of 1.94 GB (131%). Consider increasing available memory.

How do you increase available memory? i have 8gb but its just using just 2

AAp_star

Proxy not working for chrome browser

When I added my proxy configuration to the playwright crawler that makes use of the chromium type browser, i throws and error meanwhile it doesnt throw an error when i specify firefox browser

code

)

crawler = PlaywrightCrawler(
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,

# Headless mode, set to False to see the browser in action.
headless=False,

# Browser types supported by Playwright.
browser_type='chromium',

proxy_configuration=proxy_configuration,
use_session_pool=True,

)

6 comments

AAp_star

Proxy authentication

How do i set the username and password of my proxy when using python for crawlee py

4 comments

AAp_star

Crawlee Proxy

How do i use proxy crawlee servers if i don't have access to third party proxies

2 comments

AAp_star

I am having issues with my crawlee proxy configuration. This is the image of the error its showing.

7 comments

Apify and Crawlee Official Forum

How can I use the Playwright Crawler and BeautifulSoup Crawler in the same Actor?

Using-proxies-with-playwright-and-best-proxy-service-providers

Parallel Scraping

Pydantic Exception

How to send post request (I'm doing reverse engineering)

Need Some Help On How To Send Datasets Automatically To Actors

How to re visit a url that is already scraped?

limit extraction for free plan users

Apify Reel Scraper is not scraping correctly.

infinite_scroll | how to get the updated page

Disable persistant storage

How can I pass data extracted in the first part of the scraper to items that will be extracted later

Regarding "tweets mentioning me", I only want to retrieve tweets where users directly mention me

How can i use proxy with playwright apify

Error adding request to the queue: Request ID does not match its unique_key.

Encountering NotImplementedError when integrating Crawlee code with REST

Blog Apify + Qdrant + AWS.

The python client reference webpage is blank

Python scrpping

Will crawlee scrape data loaded from JS that is triggered by scroll?

Memory is critically overloaded

Proxy not working for chrome browser

Proxy authentication

Crawlee Proxy

I am having issues with my crawlee proxy configuration. This is the image of the error its showing.