Hello folks! Is there any way i can force cheerio/playwright crawlers to stop using their own internal request queue and instead "enqueue links" to another queue service such as Redis? I would like to achieve this in order to be able to run multiple crawlers on a single website and i would need them to share the same queue so they won't use duplicate links. Thanks in advance!
I have a CheerioCrawler that runs in a docker container that is listening to incoming messages(each message is a url) from RabbitMQ from the same Queue. The crawler runs and finishes the first crawling job successfully. However, if it receives the second message(url) which would be the same as the first one, it just outputs the message in the screenshot i've attached. Basically, it doesn't add the second message/url correctly to the crawler's queue. What would be a solution for this? I have thought about restarting the crawler or emptying the kv storage, but i can't seem to get it working. in my crawlerConfigs, i am setting 'purgeOnStart' and 'persistStorage' to false.
I have a cheerio crawler that works and basically "gets the sitemap" of a website. My issue is that due to the fact that the enqueueLinks() uses requests in such a fast manner that I get blocked from several sites when using the crawler as they believe that i am trying to spam requests/attack them (basically it leads to me having my ip banned). Can i make it that the requests between links are delayed a bit so it doesn't seem like i am an attacker because the number of requests that the crawler makes to a website is really high in a short manner of time?