I'm writing my first Actor using Crawlee and Playwright crawler to scrape website
https://sreality.cz.
I wrote a crawler using as much as possible from the examples in the documentation. It works like this:
- Start on the first page of search, for example this one.
- Skip ad dialog, if it shows.
- Find all links to next pages and add them to the queue with
enqueueLinks()
. - Find all links to individual items (apartments, houses, whatever) and add them to the queue with
enqueueLinks()
. - If next page to process is an item page, scrape the data and save with
pushData()
. Otherwise, if it's another page, repeat from 3.
In theory, this is all I need to scrape the entire search result list. However what I experience is that it will enqueue all the links (around 185) but only process around 30 of them before finishing. Very strange.
I tried to set
maxRequestsPerCrawl: 1000
, didn't help.
Maybe I'm missing something but I don't see why it would just stop after around 30 pages. Is there another config somewhere that controls this?
Even more strange, it then logs the final statistic where it says something like
"requestsFinished":119
. A number that doesn't make sense at all. Less than the number of actually enqueued links but a lot more than the number of actuall processed pages.