Apify

Apify and Crawlee Official Forum

b
F
A
J
A
This is so that Playwright can fill in and submit a website search page which uses dynamic Javascript. When the results are shown I want to be able to use the BeautifulSoup crawler to open each product page and parse the information. If I use Playwright to open each product page, this takes a very long time. I cannot seem to run both Crawlers at the same time.
7 comments
H
A
M
A
How to use proxies with Playwright? And, what are the best proxy service providers? Note that I'm new to web-scraping and I'm using Crawlee python
1 comment
H
Does anyone know when Parallel Scraping and Request Locking are comming to the python version?
1 comment
!
I'm building a request queue of URLs and most run fine but I will receive the following exception and not sure how to proceed.

pydantic_core._pydantic_core.ValidationError: 1 validation error for Request
user_data.__crawlee.state
Input should be 0, 1, 2, 3, 4, 5, 6 or 7 [type=enum, input_value='RequestState.REQUEST_HANDLER', input_type=str]
For further information visit https://errors.pydantic.dev/2.9/v/enum
2 comments
T
t
I'm conducting reverse engineering and have discovered a link that retrieves all the data I need using the POST method. I've copied the request as cURL to analyze the parameters required for making the correct request.

I've modified the parameters to make the request using the POST method. I've successfully tested this using httpx, but now I want to implement it using the Crawlee framework.

How can I change the method used by the HTTP client to retrieve the data, and how can I pass the modified parameters I've prepared?

Additionally, if anyone has experience, I'd appreciate any insights on handling POST requests within this framework.

Thanks
1 comment
M
Hey Guys, been trying to figure out how to integrate the following actors: 1. compass/crawler-google-places 2. vdrmota/contact-info-scraper 3. lukaskrivka/dedup-datasets.

What Im Looking To Do:
  1. Google Maps Scraper runs and the results from the dataset go to the ContactScraper to enrich the data (DONE Since they integrate using the APIFY INTEGRATION).
  2. After point 1. I have two datasets (one from Google Maps Scraper and the other from the Contact Scraper). I want to reference these datasets in the third actor (Dedup-Datasets) ,so it will merge/match the data into a clean output.
The issues are since the Dedup-Dataset Actor's input requires the numerical datasets IDs ( wont seem to accept variables for the data set)... This means I need to manually take the two datasets after each run and enter them into the third actor. I would like the variables from the two actor datasets to seamlessly pass to the third actor but cant really find a good workaround.

Help Would Be Appreciated!
Hi
I'm making a simple app that gets updated information from a website.
This is inside a fastapi app and It uses AsyncIOScheduler to run the script every day,
The issue is since the crawl is already visited the main page, for the next call, It will not re visit the page.
I've did a lot of research but couldn't find a solution, other scrapers has someting like force= parameter to force the scrape.
How can we fource the UNPROCESSED to the request?
Here is the code
Plain Text
class Scraper:
    async def run_scraper(self):
        proxy_urls = process_proxy_file('proxy_list.txt')
        proxy_configuration = ProxyConfiguration(proxy_urls=proxy_urls)
        crawler = PlaywrightCrawler(proxy_configuration=proxy_configuration, headless=False, browser_type='chromium')

        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            print('Handling request...')
            context.request.state(RequestState.UNPROCESSED)

            # Scrape logic here
            # Return scraped data if needed

        request = Request.from_url('https://crawlee.dev')
        await crawler.run([request])
        return "Example Scraped Data"
1 comment
P
I habe built an Instagram profile scraper, in python, but want to limit the scraping result to 25 for those free plan users not paid plan users.
Can anybody help me out?
2 comments
M
d
Hey, I created this simple script:
Plain Text
import asyncio

# Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript.
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
import urllib.parse


async def main(terms: str) -> None:
    crawler = PlaywrightCrawler(headless=False)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Wait for the collection cards to render on the page. This ensures that
        # the elements we want to interact with are present in the DOM.
        await context.page.wait_for_load_state("networkidle")
        await context.infinite_scroll()

    url = f"https://www.youtube.com/results?search_query={urllib.parse.quote(terms)}&sp=EgIwAQ%253D%253D"
    await crawler.run([url])


if __name__ == "__main__":
    asyncio.run(main("music"))
`

But I want to get the content of the page while infinite_scroll scroll the page, like that I can see the new content and I can make action according to them, but await context.infinite_scroll() never stop so I can't put an action behind it to run the thinkg I want, how can I manage that? (I want to get the new link of youtube video)
2 comments
O
J
Hi guys, I plan to deploy my crawler (a ParselCrawler one) to AWS Lambda. I'm loosely following this guide which is for JavaScript though. I'd like disable persising the storage. I change the configuration like this:
Plain Text
config = Configuration.get_global_configuration()
config.persist_storage = False

and also tried to supply the configuration to the ParselCrawler constructor, neither of these works though. I'm still getting the storage directory created. Am I doing something wrong here?
Thanks
1 comment
O
Hi. I'm extracting prices of products. In the process, I have the main page where I can extract all the information I need except for the fees. If I go through every product individually, I can get the price and fees, but sometimes I lose the fee information because I get blocked on some products. I want to handle this situation. If I extract the fees, I want to add them to my product_item, but if I get blocked, I want to pass this data as empty. I'm using the "Router" class as the Crawlee team explains here: https://crawlee.dev/python/docs/introduction/refactoring. When I add my URL extracted from the first page as shown below, I cannot pass data extracted before:

await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES')

I want something like this:

await context.enqueue_links(url='product_url', label='PRODUCT_WITH_FEES', data=product_item # type: dict)

But I cannot do the above. How can I do it?

So, my final data will be showed as:

If I handle the data correctly I want something like this:
product_item = {product_id: 1234, price: 50$, fees: 3$}

If I get blocked, I have something like this:
product_item = {product_id: 1234, price: 50$, fees: ''}
4 comments
f
M
Hi. I only want to retrieve tweets where users directly mention me in the tweet, not tweets that mention me because they are replies to tweets that mentioned me. Can you do this? For more details on this issue, please refer to: https://devcommunity.x.com/t/how-to-differentiate-direct-reply-and-mentions/149262.

Thanks.
3 comments
T
O
hi I'm trying to make a scraper and i don't now how to implement a proxy hosted by apify in my script i share you a code to see why i'm trying to do
2 comments
k
M
Hi and good day. I'm creating a POST API that access the following JSON body:
{
"url": "https://crawlee.dev/python/",
"targets": ["html", "pdf"]
}

Inside the list of targets, is the extension which my code downloads if it discovers.

I'm already at my wit's end since I don't get the error I'm getting which is:
[crawlee.memory_storage_client._request_queue_client] WARN Error adding request to the queue: Request ID does not match its unique_key.

Does anyone encountered this problem?

The following is my whole code:
9 comments
A
N
O
H
M
Crawlee Code works when I put the URL directly inside the crawlee.run([url]), but when I put the code inside an endpoint and execute the URL on postman error NotImplementedError shows.

How to use crawlee w/ FastAPI?
I am interested in writing a blog to use Apify ad Qdrant. I would like to know if this would create any costs as i am new into Apify. The idea is to combine Apify with Qdrant and AWS Services, create an App and showcase it in Medium and any blog where Apify is intereted. Would love to get in contact with anyone from Apify to discuss this opportunity.

Mainly I am interested in scraping FAQ websites from AWS

https://aws.amazon.com/sagemaker/faqs/
2 comments
M
S
Someone is wanted to tell me how to send mouse event to unactivated window in windows system
Looking for a solution for scraping a website that fills in product details via JS on scroll.

Will crawlee “scroll” and scrape after these loads or will this require other means?

Thank you!
[crawlee._autoscaling.snapshotter] WARN Memory is critically overloaded. Using 2.54 GB of 1.94 GB (131%). Consider increasing available memory.

How do you increase available memory? i have 8gb but its just using just 2
When I added my proxy configuration to the playwright crawler that makes use of the chromium type browser, i throws and error meanwhile it doesnt throw an error when i specify firefox browser

code


)


crawler = PlaywrightCrawler(
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,

# Headless mode, set to False to see the browser in action.
headless=False,

# Browser types supported by Playwright.
browser_type='chromium',

proxy_configuration=proxy_configuration,
use_session_pool=True,


)
6 comments
M
A
How do i set the username and password of my proxy when using python for crawlee py
4 comments
A
M
A
A
Ap_star
·

Crawlee Proxy

How do i use proxy crawlee servers if i don't have access to third party proxies
2 comments
A
M