Incremental Web scraping using Crawlee

Question

Hey everyone. :perfecto: :crawlee:
Currently, I am working on scraping one website where new content (pages) is added frequently (as an example we can say like a blog). So when I run my scraper it scrapes all pages successfully but when I run it for example tomorrow (when new pages are added to websites) it will start scraping everything again.

I would be thankful if you could give me some advice, ideas, solutions, or examples out there of efficiently re-scraping without crawling the entire site again.

Thank you in advance. 🙏🏻

memo23 · Answer

@titavilanova2 dm me

azzouz · Answer

You can save your previously scrapped in some file (could be a simple file or a named key value store if you're using crawlee) then on next executions you'd collect all URLs, filter on the new ones and scrape the delta

azzouz · Answer

Or may be check if there's some sitemap file

Apify and Crawlee Official Forum

Incremental Web scraping using Crawlee