Christian

Is there a way to upload/download RequestQueue data to Apify?

Title.

I would like to sync the data somehow.

4 comments

Is it better to separate scraper by domains?

I have about 10 different domains that I want to crawl. Each one of those domain has:

List view
Detail view

Is it better to separate those scrapers? I don't need high concurrency. This is a low workload. But how does the route handler look like?

2 comments

CChristian

Strategy to prevent crawling data that has been crawled before

I have a web page with pagination, and in those paginations, contain links that I need to crawl. These paginations will be added new items periodically, with newer items appear on top.

I am thinking of a pseudocode to just crawl what's needed. Something like this:

For page 1 to n
- Collect item links
- For each link
  - If link is visited, exit/shutdown scraper completely
  - Else put into DETAIL queue

Later on, on DETAIL handler:

Scrape
Mark link as visited

Now, I am thinking on how to actually mark this link as visited. I am thinking of having the crawling script connect to a database where the link is primary key, and just check if the link is already in database or not.

I am assuming that Crawlee on Apify platform would be able to create a connection to an external database. Please correct me if I am wrong.

Am I overcomplicating stuffs or is there a better idea?

6 comments

CChristian

Is there a way to use both PlaywrightCrawler and CheerioCrawler?

I am getting PDF links using PlaywrightCrawler. I would like to download those PDF links, but when I use sendRequest with it, I got the error waiting until "load". I suspect this is because of PlaywrightCrawler, it seems that I need to use CheerioCrawler? Is this correct? If so, how do I use both?

4 comments

Apify and Crawlee Official Forum

Is there a way to upload/download RequestQueue data to Apify?

Is it better to separate scraper by domains?

Strategy to prevent crawling data that has been crawled before

Is there a way to use both PlaywrightCrawler and CheerioCrawler?