Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
AltairSama2
A
AltairSama2
Offline, last seen last month
Joined August 30, 2024
hey folks, we are trying to integrate a proxy into our crawlers and the issue is the proxy needs certificate to be present before it'll allow us to authenticate, I couldnt find any option for this in the documentation.

Is there a way I can add those certs in crawlee/playwright? or if crawlee exposes agentOptions from Playwright anywhere (couldn't find it in the docs), that'll also work as per https://github.com/microsoft/playwright/issues/1799#issuecomment-959011162
26 comments
A
P
A
Hey folks, I have a bit of a weird use case here, I need to scrape multiple sites and I know the recommended approach is to use a single scraper and pass in the urls to the starting queue

But issue with that is, the logging gets messed up with all the different site logs getting mixed together, to fix that I created multiple loggers for it but the issue I am facing is,

when we import log from crawlee it looks like its shared between all the places that has imported log too, and I have a custom logging implementation which is different for each site (mainly in the format of logs), and I need crawlee to respect that.

So to fix this I looked at the docs and createPlaywrightCrawler has an option for log where you can pass in your own logger to it, and while it uses the correct format for crawlee's internal logging, it does nothing to the logs which I explicitly log in my routehandler.

so How do I either separate log instances for each instance, or make my route handler respect the logger i have passed when I created the crawler
6 comments
A
O
hey folks, basically the title, I want to stop scraping in the middle of a route handler if some conditon is met because the full function is a little expensive computationally.

I am using return; to exit the route handler when some condition is met but I am facing issues with the the node process hanging indefinitely after scraping is done. I had a previous thread using crawler.teardown() and how removing the return statement stopped the hanging issue.

But even in handlers where I am not calling crawler.teardow(), I am facing hanging issues.

is there a better way to accomplish this?
10 comments
A
L
I am running a Playwright crawler and I call await crawler.teardown() after some condition is met,

Issue with this approach is that the crawlee process hangs indefinitely and doesn't exit even after a while. I have to force stop it using CTRL + C, on searching github, it looks like its an issue with some timeouts but I havent set one anywhere.

here's how I am initializing my crawler:
Plain Text
export const crawler = new PlaywrightCrawler({
    requestHandler:router,
    maxRequestsPerMinute:10,
})

and I have a playwright config where I have set the default timeout
Plain Text
import { defineConfig } from '@playwright/test';

export default defineConfig({
    timeout: 3000,
});
3 comments
A
A
Hey folks,

I have a site which gives out internal links which then redirects to the actual page and I want to scrape the redirected link, I tried using the sendRequest api but the response's url only had the internal url and not the actual link, so now I am doing it via playwright like this
Plain Text
        const context = page.context();
        const newPagePromise = context.waitForEvent('page');
        await applyNowBtn.click();
        const newPage = await newPagePromise;
        await newPage.waitForTimeout(3000)
        log.info(`url is ${newPage.url()}`);
        newPage.close();

issue with this approach is (aside from having to use a timeout), crawlee logs these cant set cookie errors which adds a lot of unnecessary noise to my logs, I just wanted to know if there was a better approach to handle this?
4 comments
A
O
In Selenium, we can check for element visibility by either using a WebDriverWait or just manually checking for the element, and in playwright this is done through expect assertions, while we can wait for an element using waitForSelector and using an appropriate timeout, i couldn't find a way to use expect location assertions inside the crawler.

Is there a way I can use them?

ref: https://playwright.dev/docs/api/class-page#page-get-by-title
3 comments
A
H
Hey folks, I want to stop the scraper/crawler if I hit some arbritrary condition. Is there a way that I can do so from inside the RequestHandler? the closest function that I found is crawler.teardown() but it cant be executed inside a handler,
12 comments
A
L
A
A
Is there a way we can pass in some variables and have them accessible inside the crawler? main use case is to have some internal variables which we can check/modify and execute some conditional logic using them.

for e.g. lets say we have duplicate_count variable (I know we already have this functionality but this is just an example), and I'll update it if the data is already there in my db and stop the crawler if the count exceeds some threshold. Is there a way I could implement this?
13 comments
L
A
J
hey folks, we are trying to integrate a proxy into our crawlers and the issue is the proxy needs certificate to be present before it'll allow us to authenticate, I couldnt find any option for this in the documentation.

Is there a way I can add those certs in crawlee/playwright? or if crawlee exposes agentOptions from Playwright anywhere (couldn't find it in the docs), that'll also work as per https://github.com/microsoft/playwright/issues/1799#issuecomment-959011162
26 comments
A
P
A
Hey folks, I have a bit of a weird use case here, I need to scrape multiple sites and I know the recommended approach is to use a single scraper and pass in the urls to the starting queue

But issue with that is, the logging gets messed up with all the different site logs getting mixed together, to fix that I created multiple loggers for it but the issue I am facing is,

when we import log from crawlee it looks like its shared between all the places that has imported log too, and I have a custom logging implementation which is different for each site (mainly in the format of logs), and I need crawlee to respect that.

So to fix this I looked at the docs and createPlaywrightCrawler has an option for log where you can pass in your own logger to it, and while it uses the correct format for crawlee's internal logging, it does nothing to the logs which I explicitly log in my routehandler.

so How do I either separate log instances for each instance, or make my route handler respect the logger i have passed when I created the crawler
6 comments
A
O
hey folks, basically the title, I want to stop scraping in the middle of a route handler if some conditon is met because the full function is a little expensive computationally.

I am using return; to exit the route handler when some condition is met but I am facing issues with the the node process hanging indefinitely after scraping is done. I had a previous thread using crawler.teardown() and how removing the return statement stopped the hanging issue.

But even in handlers where I am not calling crawler.teardow(), I am facing hanging issues.

is there a better way to accomplish this?
10 comments
A
L
I am running a Playwright crawler and I call await crawler.teardown() after some condition is met,

Issue with this approach is that the crawlee process hangs indefinitely and doesn't exit even after a while. I have to force stop it using CTRL + C, on searching github, it looks like its an issue with some timeouts but I havent set one anywhere.

here's how I am initializing my crawler:
Plain Text
export const crawler = new PlaywrightCrawler({
    requestHandler:router,
    maxRequestsPerMinute:10,
})

and I have a playwright config where I have set the default timeout
Plain Text
import { defineConfig } from '@playwright/test';

export default defineConfig({
    timeout: 3000,
});
3 comments
A
A
Hey folks,

I have a site which gives out internal links which then redirects to the actual page and I want to scrape the redirected link, I tried using the sendRequest api but the response's url only had the internal url and not the actual link, so now I am doing it via playwright like this
Plain Text
        const context = page.context();
        const newPagePromise = context.waitForEvent('page');
        await applyNowBtn.click();
        const newPage = await newPagePromise;
        await newPage.waitForTimeout(3000)
        log.info(`url is ${newPage.url()}`);
        newPage.close();

issue with this approach is (aside from having to use a timeout), crawlee logs these cant set cookie errors which adds a lot of unnecessary noise to my logs, I just wanted to know if there was a better approach to handle this?
4 comments
A
O
In Selenium, we can check for element visibility by either using a WebDriverWait or just manually checking for the element, and in playwright this is done through expect assertions, while we can wait for an element using waitForSelector and using an appropriate timeout, i couldn't find a way to use expect location assertions inside the crawler.

Is there a way I can use them?

ref: https://playwright.dev/docs/api/class-page#page-get-by-title
3 comments
A
H
Hey folks, I want to stop the scraper/crawler if I hit some arbritrary condition. Is there a way that I can do so from inside the RequestHandler? the closest function that I found is crawler.teardown() but it cant be executed inside a handler,
12 comments
A
L
A
A
Is there a way we can pass in some variables and have them accessible inside the crawler? main use case is to have some internal variables which we can check/modify and execute some conditional logic using them.

for e.g. lets say we have duplicate_count variable (I know we already have this functionality but this is just an example), and I'll update it if the data is already there in my db and stop the crawler if the count exceeds some threshold. Is there a way I could implement this?
13 comments
L
A
J
Hey folks, I have a list of urls like ["https//google.com"] etc, and when I call enqueueLinks({urls, label:'DETAIL'}), none of the links are enqueued and the crawler stops right there, but if I do
Plain Text
crawler.addRequests(filteredLinks.map(link=>({url:link, label:DETAIL}))) 

the links are added as expected and the crawler works fine, I just wanted what's the difference between the two and why enqueueLinks was not working here? crawlee ver is 3.7.0
6 comments
A
A
w
Hey folks, I have a list of urls like ["https//google.com"] etc, and when I call enqueueLinks({urls, label:'DETAIL'}), none of the links are enqueued and the crawler stops right there, but if I do
Plain Text
crawler.addRequests(filteredLinks.map(link=>({url:link, label:DETAIL}))) 

the links are added as expected and the crawler works fine, I just wanted what's the difference between the two and why enqueueLinks was not working here? crawlee ver is 3.7.0
6 comments
A
w
A
Hey folks,

I am explicitly creating request queues for my crawlers to make sure that crawler specific run options such as maxRequestsPerCrawl can be set on a crawler to crawler basis, but the issue with this approach is that the request queues are not getting purged after every crawl, resulting in the crawlers resuming the session from before. These are the approaches I tried

  1. I have tried setting the option purgeRequestQueue to true explicitly in the crawler.run() func but it results in this error
Did not expect property purgeRequestQueue to exist, got true in object options` 2. setting it as a global variable in crawlee.json (it looks like crawlee is not picking up my crawlee.json file at all, because I tried to set logging levels in it and crawlee didnt pick it up). 3. tried using await purgeDefaultStorages()` in my entry file

None of these options are working, is there some other way to purge these queues? I know its set by default to purge them but its not working for my named queues.

Also, is using queues the best way to isolate crawler specific options for each crawler? because when I used the default queue and restricted crawls to some numeric value in one crawler, and when it shut down after reaching that value, all the other crawlers would also shut down logging that max requests per crawl has been reached despite me not having specified this option when I initialized the crawlers.
9 comments
A
L
Hey folks,

I am explicitly creating request queues for my crawlers to make sure that crawler specific run options such as maxRequestsPerCrawl can be set on a crawler to crawler basis, but the issue with this approach is that the request queues are not getting purged after every crawl, resulting in the crawlers resuming the session from before. These are the approaches I tried

  1. I have tried setting the option purgeRequestQueue to true explicitly in the crawler.run() func but it results in this error
Did not expect property purgeRequestQueue to exist, got true in object options` 2. setting it as a global variable in crawlee.json (it looks like crawlee is not picking up my crawlee.json file at all, because I tried to set logging levels in it and crawlee didnt pick it up). 3. tried using await purgeDefaultStorages()` in my entry file

None of these options are working, is there some other way to purge these queues? I know its set by default to purge them but its not working for my named queues.

Also, is using queues the best way to isolate crawler specific options for each crawler? because when I used the default queue and restricted crawls to some numeric value in one crawler, and when it shut down after reaching that value, all the other crawlers would also shut down logging that max requests per crawl has been reached despite me not having specified this option when I initialized the crawlers.
9 comments
A
L
hey folks,
I had to write up a custom logger implementation because I needed to store the logs in a file and rotate them as well. I have extended LoggerText and have implemented the methods as well and while its working its skipping a lot of debug messages that we get with the inbuilt logger as well as not being able to handle objects being passed to it. I have looked at the source code for it and have stuck to it as much as I could.

I would really appreciate it if someone could point out whats wrong with it.

code: https://pastebin.com/tpGxyM0P

actual output: https://pastebin.com/ABtC8QSa
10 comments
A
L
hey folks,
I had to write up a custom logger implementation because I needed to store the logs in a file and rotate them as well. I have extended LoggerText and have implemented the methods as well and while its working its skipping a lot of debug messages that we get with the inbuilt logger as well as not being able to handle objects being passed to it. I have looked at the source code for it and have stuck to it as much as I could.

I would really appreciate it if someone could point out whats wrong with it.

code: https://pastebin.com/tpGxyM0P

actual output: https://pastebin.com/ABtC8QSa
10 comments
A
L
I have been trying to port our scrapers from Selenium/Python to crawlee mainly because of the anti bot protections already built into it. The issue I am facing is I am having a hard time translating our functions 1-to-1 from selenium to Crawlee because a lot of it depends on the selenium driver or in Playwright's case browser instance, for e.g.

I need to click on an element to get the link because there's a redirect in between and I need to wait for it before grabbing it and I cant use enqueueLinksByClickingElements because I need it in the same request for my dataset to be complete.

There are other such issues I am having trouble with and I know we have Page exposed but that's just a single tab in a browser's context and I need more control over it for my usecase.

Is this something that's possible with Crawlee? or are there any workarounds that I can use for this same functionality?
14 comments
A
P
A
I have been trying to port our scrapers from Selenium/Python to crawlee mainly because of the anti bot protections already built into it. The issue I am facing is I am having a hard time translating our functions 1-to-1 from selenium to Crawlee because a lot of it depends on the selenium driver or in Playwright's case browser instance, for e.g.

I need to click on an element to get the link because there's a redirect in between and I need to wait for it before grabbing it and I cant use enqueueLinksByClickingElements because I need it in the same request for my dataset to be complete.

There are other such issues I am having trouble with and I know we have Page exposed but that's just a single tab in a browser's context and I need more control over it for my usecase.

Is this something that's possible with Crawlee? or are there any workarounds that I can use for this same functionality?
14 comments
A
P
A
Hey everyone,

what is the recommended way to structure your code to scrape multiple sites? I looked at a few questions here and it seems that the recommended way is to use a single crawler but multiple routers to handle this case. the issue, I am facing from this is that when you enqueue links, you'll add site-1 and then site-2 initally before starting the crawler and then the crawler will dynamically add in the links as needed but this will mess up the logs since we are using a Queue and its FIFO, so first it'll crawl the first link, add the extracted links to the queue and then crawl the second link and its links to the queue and like this it'll keep switching contexts between the two sites which will make the logs a mess. Also routers, dont seem to have a url parameter, its just a category and then the request, so we will have to basically define handlers for each site in a single router right? which will just bloat up a single file.

Is there a better way I can structure this? usecase is to setup crawlers for 10+ sites and crawl them sequentially or in parallel but having sane logging for them.
21 comments
A
P
A