robots.txt Compatibility

Muhammet · 2024-09-19T07:05:57.855Z

Hi guys 👋 , my Apify actor can pull data from the website even though the robots.txt setting is “TRUE”. When I test it on my own server, it complies with robots.txt rules. Doesn't Apify automatically follow robots.txt rules? Can't we set it manually? I haven't found any documentation on this.

Apify does not automatically enforce robots.txt rules by default. This is because Apify focuses on providing flexibility for web scraping and automation, and some use cases may require bypassing these rules (within the bounds of legality and ethics). Therefore, even if the robots.txt setting is "TRUE," it might not be enforced automatically unless explicitly handled in your code.

You can manually enforce robots.txt rules by adding logic to your actor. For example, you can use libraries like robots-txt-guard in Node.js to parse and respect robots.txt restrictions before pulling data from a website.

Here's a basic approach:

Parse the robots.txt file from the target site.
Check whether your actor is allowed to scrape specific endpoints.
Proceed based on the result of the check.

Apify and Crawlee Official Forum

robots.txt Compatibility