Hi , based on the error I would expect that you get Timeouted by tripadvisor, are you using any antibot-detection approach, such as proxies, fingeprints, etc.?
no it's a simple code for making request.
it works for other url but not for trip advisor.
can you share reproducible example of code?
I don't think this is related to Apify directly, might be some config in requests
library or similar. We need to see the code ofc.
This is the code picture. Please check.
this is the boiler plate of apify.
just advanced to level 2! Thanks for your contributions! π
Hi lukas hope you're doing well, can you please take a look.
I will try to reproduce it
thanks, waiting for your reply.
Hi have you check , please let me know. Thank you
I can reproduce it, checking with the team
thank you really appreciate it.
what is the error? are you able to check it
Hi Lukas Please reply I need this working so I can run my Crawler, please help me.
Sorry, this might take a while before the Python team figures this out cc
Thanks for the update. Really appreciate it.
just advanced to level 3! Thanks for your contributions! π
Maybe using requests
library or something else would fix it
or just finding ignore SSL option
with little help from chatgpt π
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
# Define a function to scrape the given URL up to the specified maximum depth
def scrape(url, depth, max_depth):
if depth > max_depth:
return
print(f'Scraping {url} at depth {depth}...')
# Try to send a GET request to the URL
try:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# If we haven't reached the max depth, look for nested links and enqueue their targets
if depth < max_depth:
for link in soup.find_all('a'):
link_href = link.get('href')
if link_href and link_href.startswith(('http://', 'https://')):
link_url = urljoin(url, link_href)
print(f'Found link: {link_url}')
scrape(link_url, depth + 1, max_depth)
# Extract and print the title of the page
title = soup.title.string if soup.title else "No Title"
print(f'Title: {title}')
except requests.exceptions.RequestException as e:
print(f'An error occurred: {e}')
# Main function to start scraping
def main():
start_urls = [{'url': 'https://www.tripadvisor.com/'}] # Example start URL
max_depth = 1 # Example max depth
# Start scraping from the first URL
for start_url in start_urls:
url = start_url.get('url')
print(f'Starting scrape for: {url}')
scrape(url, 0, max_depth)
if __name__ == "__main__":
main()
site is secured by ssl so we cannot ignore it i think.
JS libraries allow to ignore SSL errors, let me check
the link is for the Python httpx library