Apify

Apify and Crawlee Official Forum

b
F
A
J
A

robots.txt Compatibility

Hi guys ๐Ÿ‘‹ , my Apify actor can pull data from the website even though the robots.txt setting is โ€œTRUEโ€. When I test it on my own server, it complies with robots.txt rules. Doesn't Apify automatically follow robots.txt rules? Can't we set it manually? I haven't found any documentation on this.
O
1 comment
Apify does not automatically enforce robots.txt rules by default. This is because Apify focuses on providing flexibility for web scraping and automation, and some use cases may require bypassing these rules (within the bounds of legality and ethics). Therefore, even if the robots.txt setting is "TRUE," it might not be enforced automatically unless explicitly handled in your code.

You can manually enforce robots.txt rules by adding logic to your actor. For example, you can use libraries like robots-txt-guard in Node.js to parse and respect robots.txt restrictions before pulling data from a website.

Here's a basic approach:

  • Parse the robots.txt file from the target site.
  • Check whether your actor is allowed to scrape specific endpoints.
  • Proceed based on the result of the check.
Add a reply
Sign up and join the conversation on Discord
Join