Apify Community

Solved

Crawlee with multiple Crawlers?

Does the python crawlee allow for multiple crawlers to be run using one router?

Plain Text

router = Router[BeautifulSoupCrawlingContext]()

Just asking as a coleague asked me if it would be possible because curl requests are a lot faster than playwright, so if we can use curl for half the requests and only load the browser for the other portion where it's needed, it could significantly speed up some processes

1 comment

bboatbxy

Handling Dynamic Links with Crawlee PlaywrightCrawler

I’m working on a project using PlaywrightCrawler to scrape links from a dynamic JavaScript-rendered website. The challenge is that the <a> tags don’t have href attributes, so I need to click on them and capture the resulting URLs.

Delayed Link Rendering: Links are dynamically rendered with JavaScript, often taking time due to a loader. How can I ensure all links are loaded before clicking?
Navigation Issues: Some links don’t navigate as expected or fail when trying to open in a new context.
Memory Overload: I get the warning "Memory is critically overloaded" during crawls

I've attached images of my code (it was too long so I couldn't paste it)

How can I handle these issues more efficiently, especially for dynamic and JavaScript-heavy sites?
I would appreciate any help

2 comments

BBorz

Extracting a websites URLs, prioritizing URLs in the footer

Hi,
I need help with finding an actor or setting up the settings of the website content crawler to extract all the URLs from a site but not the content from the URL, I want to filter the URLs by keywords to find the one Im looking for, but dont need the content from the URL

Thanks for your help

HHatem

ImportError: cannot import name 'service_container' from 'crawlee'

When I build actor and run it, I get the following error:
2025-01-10T18:47:43.475Z Traceback (most recent call last):
2025-01-10T18:47:43.476Z File "<frozen runpy>", line 198, in _run_module_as_main
2025-01-10T18:47:43.477Z File "<frozen runpy>", line 88, in _run_code
2025-01-10T18:47:43.478Z File "/usr/src/app/src/main.py", line 3, in <module>
2025-01-10T18:47:43.479Z from .main import main
2025-01-10T18:47:43.479Z File "/usr/src/app/src/main.py", line 9, in <module>
2025-01-10T18:47:43.480Z from apify import Actor
2025-01-10T18:47:43.481Z File "/usr/local/lib/python3.12/site-packages/apify/init.py", line 7, in <module>
2025-01-10T18:47:43.482Z from apify._actor import Actor
2025-01-10T18:47:43.483Z File "/usr/local/lib/python3.12/site-packages/apify/_actor.py", line 16, in <module>
2025-01-10T18:47:43.483Z from crawlee import service_container
2025-01-10T18:47:43.484Z ImportError: cannot import name 'service_container' from 'crawlee' (/usr/local/lib/python3.12/site-packages/crawlee/init.py)

I did not change anything in my docker file:
FROM apify/actor-python:3.12
COPY requirements.txt ./
...

In requirements.txt I install the following module:
apify ~= 2.0.0

Anyone else facing the same issue ?

3 comments

NNth

AdaptivePlaywrightCrawler starts crawling the whole web at some point.

I want to use the AdaptivePlaywrightCrawler, but it seems like it wants to crawl the entire web.
Here is my code.

const crawler = new AdaptivePlaywrightCrawler({
      renderingTypeDetectionRatio: 0.1,
      maxRequestsPerCrawl: 50,
      async requestHandler({ request, enqueueLinks, parseWithCheerio, querySelector, log, urls }) {
        console.log(request.url, request.uniqueKey);
        await enqueueLinks();
      }
});

crawler.run(['https://crawlee.dev']);

3 comments

RRigos

Solved

Parsel Crawler way too dank with request speed

Hi everyone! I am creating a crawler using crawlee for python. I noticed the Parsel crawler makes the requests in much higher frequency than the Beautiful soup crawler. Is there a way to make the Parsel crawler slower, so we avoid getting blocked better? Thanks!

5 comments

!! Spooky

Moving from Playwright to Crawlee/Playwright for Scraping

Are there actually any ressources on building a scraper with crawlee except the one in the docs?
Where do I set all the browser context for example?