AltairSama2

Add certificates to Playwright crawler using Chromium

hey folks, we are trying to integrate a proxy into our crawlers and the issue is the proxy needs certificate to be present before it'll allow us to authenticate, I couldnt find any option for this in the documentation.

Is there a way I can add those certs in crawlee/playwright? or if crawlee exposes agentOptions from Playwright anywhere (couldn't find it in the docs), that'll also work as per https://github.com/microsoft/playwright/issues/1799#issuecomment-959011162

26 comments

AAltairSama2

RequestRouter not using the `log` instance passed in its respective Playwright crawler

Hey folks, I have a bit of a weird use case here, I need to scrape multiple sites and I know the recommended approach is to use a single scraper and pass in the urls to the starting queue

But issue with that is, the logging gets messed up with all the different site logs getting mixed together, to fix that I created multiple loggers for it but the issue I am facing is,

when we import log from crawlee it looks like its shared between all the places that has imported log too, and I have a custom logging implementation which is different for each site (mainly in the format of logs), and I need crawlee to respect that.

So to fix this I looked at the docs and createPlaywrightCrawler has an option for log where you can pass in your own logger to it, and while it uses the correct format for crawlee's internal logging, it does nothing to the logs which I explicitly log in my routehandler.

so How do I either separate log instances for each instance, or make my route handler respect the logger i have passed when I created the crawler

6 comments

AAltairSama2

crawler process not exiting after teardown is called.

I am running a Playwright crawler and I call await crawler.teardown() after some condition is met,

Issue with this approach is that the crawlee process hangs indefinitely and doesn't exit even after a while. I have to force stop it using CTRL + C, on searching github, it looks like its an issue with some timeouts but I havent set one anywhere.

here's how I am initializing my crawler:

Plain Text

export const crawler = new PlaywrightCrawler({
    requestHandler:router,
    maxRequestsPerMinute:10,
})

and I have a playwright config where I have set the default timeout

Plain Text

import { defineConfig } from '@playwright/test';

export default defineConfig({
    timeout: 3000,
});

3 comments

AAltairSama2

Scrape redirect links gracefully?

Hey folks,

I have a site which gives out internal links which then redirects to the actual page and I want to scrape the redirected link, I tried using the sendRequest api but the response's url only had the internal url and not the actual link, so now I am doing it via playwright like this

Plain Text

        const context = page.context();
        const newPagePromise = context.waitForEvent('page');
        await applyNowBtn.click();
        const newPage = await newPagePromise;
        await newPage.waitForTimeout(3000)
        log.info(`url is ${newPage.url()}`);
        newPage.close();

issue with this approach is (aside from having to use a timeout), crawlee logs these cant set cookie errors which adds a lot of unnecessary noise to my logs, I just wanted to know if there was a better approach to handle this?

4 comments

AAltairSama2

How to close the crawler from a RequestHandler?

Hey folks, I want to stop the scraper/crawler if I hit some arbritrary condition. Is there a way that I can do so from inside the RequestHandler? the closest function that I found is crawler.teardown() but it cant be executed inside a handler,

12 comments

AAltairSama2

Stop scraping in the middle of a route handler if some condition is met?

hey folks, basically the title, I want to stop scraping in the middle of a route handler if some conditon is met because the full function is a little expensive computationally.

I am using return; to exit the route handler when some condition is met but I am facing issues with the the node process hanging indefinitely after scraping is done. I had a previous thread using crawler.teardown() and how removing the return statement stopped the hanging issue.

But even in handlers where I am not calling crawler.teardow(), I am facing hanging issues.

is there a better way to accomplish this?

10 comments

AAltairSama2

How to use Playwright locator assertions in Crawlee?

In Selenium, we can check for element visibility by either using a WebDriverWait or just manually checking for the element, and in playwright this is done through expect assertions, while we can wait for an element using waitForSelector and using an appropriate timeout, i couldn't find a way to use expect location assertions inside the crawler.

Is there a way I can use them?

ref: https://playwright.dev/docs/api/class-page#page-get-by-title

3 comments

AAltairSama2

is there a way to have custom variables accessible inside the crawler function?

Is there a way we can pass in some variables and have them accessible inside the crawler? main use case is to have some internal variables which we can check/modify and execute some conditional logic using them.

for e.g. lets say we have duplicate_count variable (I know we already have this functionality but this is just an example), and I'll update it if the data is already there in my db and stop the crawler if the count exceeds some threshold. Is there a way I could implement this?

13 comments

AAltairSama2

Difference between enqueueLinks and crawler.addRequests

Hey folks, I have a list of urls like ["https//google.com"] etc, and when I call enqueueLinks({urls, label:'DETAIL'}), none of the links are enqueued and the crawler stops right there, but if I do

Plain Text

crawler.addRequests(filteredLinks.map(link=>({url:link, label:DETAIL})))

the links are added as expected and the crawler works fine, I just wanted what's the difference between the two and why enqueueLinks was not working here? crawlee ver is 3.7.0

6 comments

AAltairSama2

Named Request Queues not getting purged

Hey folks,

I am explicitly creating request queues for my crawlers to make sure that crawler specific run options such as maxRequestsPerCrawl can be set on a crawler to crawler basis, but the issue with this approach is that the request queues are not getting purged after every crawl, resulting in the crawlers resuming the session from before. These are the approaches I tried

I have tried setting the option purgeRequestQueue to true explicitly in the crawler.run() func but it results in this error

Did not expect property purgeRequestQueue to exist, got true in object options

`

2. setting it as a global variable in

crawlee.json

 (it looks like crawlee is not picking up my crawlee.json file at all, because I tried to set logging levels in it and crawlee didnt pick it up).

3. tried using

await purgeDefaultStorages()` in my entry file

None of these options are working, is there some other way to purge these queues? I know its set by default to purge them but its not working for my named queues.

Also, is using queues the best way to isolate crawler specific options for each crawler? because when I used the default queue and restricted crawls to some numeric value in one crawler, and when it shut down after reaching that value, all the other crawlers would also shut down logging that max requests per crawl has been reached despite me not having specified this option when I initialized the crawlers.

9 comments

AAltairSama2

Custom LoggerText implementation not handling objects

hey folks,
I had to write up a custom logger implementation because I needed to store the logs in a file and rotate them as well. I have extended LoggerText and have implemented the methods as well and while its working its skipping a lot of debug messages that we get with the inbuilt logger as well as not being able to handle objects being passed to it. I have looked at the source code for it and have stuck to it as much as I could.

I would really appreciate it if someone could point out whats wrong with it.

code: https://pastebin.com/tpGxyM0P

actual output: https://pastebin.com/ABtC8QSa

10 comments

AAltairSama2

How to access browser instance in Playwright Crawler?

I have been trying to port our scrapers from Selenium/Python to crawlee mainly because of the anti bot protections already built into it. The issue I am facing is I am having a hard time translating our functions 1-to-1 from selenium to Crawlee because a lot of it depends on the selenium driver or in Playwright's case browser instance, for e.g.

I need to click on an element to get the link because there's a redirect in between and I need to wait for it before grabbing it and I cant use enqueueLinksByClickingElements because I need it in the same request for my dataset to be complete.

There are other such issues I am having trouble with and I know we have Page exposed but that's just a single tab in a browser's context and I need more control over it for my usecase.

Is this something that's possible with Crawlee? or are there any workarounds that I can use for this same functionality?

14 comments

AAltairSama2

Structure Crawlers to scrape multiple sites

Hey everyone,

what is the recommended way to structure your code to scrape multiple sites? I looked at a few questions here and it seems that the recommended way is to use a single crawler but multiple routers to handle this case. the issue, I am facing from this is that when you enqueue links, you'll add site-1 and then site-2 initally before starting the crawler and then the crawler will dynamically add in the links as needed but this will mess up the logs since we are using a Queue and its FIFO, so first it'll crawl the first link, add the extracted links to the queue and then crawl the second link and its links to the queue and like this it'll keep switching contexts between the two sites which will make the logs a mess. Also routers, dont seem to have a url parameter, its just a category and then the request, so we will have to basically define handlers for each site in a single router right? which will just bloat up a single file.

Is there a better way I can structure this? usecase is to setup crawlers for 10+ sites and crawl them sequentially or in parallel but having sane logging for them.

21 comments

AAltairSama2

Python logging equivalent

Hey folks, I am porting our existing selenium based scrapers to crawlee and one thing I am used to is how verbose we can set the logging in python, is there an equivalent in crawlee?

for e.g.

Here's my logging format in python:
logging.Formatter("%(asctime)s %(name)s:%(levelname)s [%(funcName)s:%(lineno)d] %(message)s")

and we also have support for rotating logs in python.

i went through the docs and while I found how to add timestamp in the logs, there's no mention of func names or even code line numbers, The only thing I found is that we can extend default loggers in crawlee so can someone post a minimal example of how to extend these loggers to add in func names etc?

or is there support for third party logging libraries like winston that I can use for this? I am using PlayWright if that matters.

16 comments

Apify Discord Mirror

Add certificates to Playwright crawler using Chromium

RequestRouter not using the `log` instance passed in its respective Playwright crawler

crawler process not exiting after teardown is called.

Scrape redirect links gracefully?

How to close the crawler from a RequestHandler?

Stop scraping in the middle of a route handler if some condition is met?

How to use Playwright locator assertions in Crawlee?

is there a way to have custom variables accessible inside the crawler function?

Difference between enqueueLinks and crawler.addRequests

Named Request Queues not getting purged

Custom LoggerText implementation not handling objects

How to access browser instance in Playwright Crawler?

Structure Crawlers to scrape multiple sites

Python logging equivalent