agentOptions
from Playwright anywhere (couldn't find it in the docs), that'll also work as per https://github.com/microsoft/playwright/issues/1799#issuecomment-959011162log
from crawlee
it looks like its shared between all the places that has imported log
too, and I have a custom logging implementation which is different for each site (mainly in the format of logs), and I need crawlee to respect that.createPlaywrightCrawler
has an option for log where you can pass in your own logger to it, and while it uses the correct format for crawlee's internal logging, it does nothing to the logs which I explicitly log in my routehandler. await crawler.teardown()
after some condition is met, export const crawler = new PlaywrightCrawler({ requestHandler:router, maxRequestsPerMinute:10, })
import { defineConfig } from '@playwright/test'; export default defineConfig({ timeout: 3000, });
sendRequest
api but the response's url only had the internal url and not the actual link, so now I am doing it via playwright like thisconst context = page.context(); const newPagePromise = context.waitForEvent('page'); await applyNowBtn.click(); const newPage = await newPagePromise; await newPage.waitForTimeout(3000) log.info(`url is ${newPage.url()}`); newPage.close();
cant set cookie
errors which adds a lot of unnecessary noise to my logs, I just wanted to know if there was a better approach to handle this?crawler.teardown()
but it cant be executed inside a handler,return;
to exit the route handler when some condition is met but I am facing issues with the the node process hanging indefinitely after scraping is done. I had a previous thread using crawler.teardown()
and how removing the return statement stopped the hanging issue. crawler.teardow()
, I am facing hanging issues.WebDriverWait
or just manually checking for the element, and in playwright this is done through expect
assertions, while we can wait for an element using waitForSelector
and using an appropriate timeout, i couldn't find a way to use expect
location assertions inside the crawler. duplicate_count
variable (I know we already have this functionality but this is just an example), and I'll update it if the data is already there in my db and stop the crawler if the count exceeds some threshold. Is there a way I could implement this?["https//google.com"]
etc, and when I call enqueueLinks({urls, label:'DETAIL'})
, none of the links are enqueued and the crawler stops right there, but if I docrawler.addRequests(filteredLinks.map(link=>({url:link, label:DETAIL})))
3.7.0
maxRequestsPerCrawl
can be set on a crawler to crawler basis, but the issue with this approach is that the request queues are not getting purged after every crawl, resulting in the crawlers resuming the session from before. These are the approaches I triedpurgeRequestQueue
to true explicitly in the crawler.run()
func but it results in this errorDid not expect property
purgeRequestQueue to exist, got
true in object
options`
2. setting it as a global variable in
crawlee.json (it looks like crawlee is not picking up my crawlee.json file at all, because I tried to set logging levels in it and crawlee didnt pick it up).
3. tried using
await purgeDefaultStorages()` in my entry fileLoggerText
and have implemented the methods as well and while its working its skipping a lot of debug messages that we get with the inbuilt logger as well as not being able to handle objects being passed to it. I have looked at the source code for it and have stuck to it as much as I could.driver
or in Playwright's case browser
instance, for e.g. enqueueLinksByClickingElements
because I need it in the same request for my dataset to be complete. Page
exposed but that's just a single tab in a browser's context and I need more control over it for my usecase. site-1
and then site-2
initally before starting the crawler and then the crawler will dynamically add in the links as needed but this will mess up the logs since we are using a Queue and its FIFO, so first it'll crawl the first link, add the extracted links to the queue and then crawl the second link and its links to the queue and like this it'll keep switching contexts between the two sites which will make the logs a mess. Also routers, dont seem to have a url parameter, its just a category and then the request, so we will have to basically define handlers for each site in a single router right? which will just bloat up a single file.logging.Formatter("%(asctime)s %(name)s:%(levelname)s [%(funcName)s:%(lineno)d] %(message)s")