Apify and Crawlee Official Forum

Home
Members
grackle
g
grackle
Offline, last seen 4 months ago
Joined August 30, 2024
We're getting crashes in our self-developed Actor (Python, playwright) from an apify websocket thing. We don't use websockets on our own. We're not sure what's causing this.
Plain Text
ERROR Error in websocket connection
Traceback (most recent call last):
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 1301, in close_connection
           await self.transfer_data_task
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 974, in transfer_data
           await asyncio.shield(self._put_message_waiter)
       asyncio.exceptions.CancelledError

       The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
         File "/usr/local/lib/python3.11/site-packages/apify/event_manager.py", line 222, in _process_platform_messages
           async for message in websocket:
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 498, in __aiter__
           yield await self.recv()
                 ^^^^^^^^^^^^^^^^^
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 568, in recv
           await self.ensure_open()
        File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 939, in ensure_open
           raise self.connection_closed_exc()
       websockets.exceptions.ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout; no close frame received
1 comment
O
I have a Python requests actor that is somehow restarting itself after many hours , with the same input, and appending the results to the existing output. That's very annoying! If there was a crash it should just abort and i should be able to restart it on my own if i needed to. But restarting with the same input over again just wastes time and money for me. Does anyone know why this is happening?

Here's what the log has -- there's no other info about any errors, the crawler was running just fine right up until 11:08:34, was about 21% done, and then I get:
Plain Text
023-09-08T11:08:34.534Z ACTOR: Sending Docker container SIGTERM signal.
2023-09-08T11:08:48.281Z ACTOR: Pulling Docker image from repository.
2023-09-08T11:08:49.924Z ACTOR: Creating Docker container.
2023-09-08T11:08:50.307Z ACTOR: Starting Docker container.
2023-09-08T11:08:51.769Z INFO  Initializing actor...
2023-09-08T11:08:51.771Z INFO  System info ({"apify_sdk_version": "1.1.4", "apify_client_version": "1.4.1", "python_version": "3.11.5", "os": "linux"})


And then the crawler starts from scratch using the old input data -- it doesn't "restore" or keep any old state. Which means I've wasted a lot of time / actor $ re-crawling the same input.

The timeout set was about 10 days, and the crawler had been running for 22 hours before I noticed. Another strangeness is that the full log only goes back 5 hours from when I stopped it -- 06:02 UTC to 11:08 UTC. It should go back 22 hours.

How can i protect against this?
1 comment
v
Hi, I have a custom Python + requests Actor that works great. It's pretty simple, it works against a list of starting URLs and pulls out a piece of information per URL.

My question is: If (for example) one run of 1,000 input URLs takes an hour to complete, i would like to parallel-ize it 4 ways so that I can run 4,000 URLs in an hour.

What's the best way to do this? I could kick off 4 copies of the run with segmented data, but this seems like something Apify could support natively.

I saw that if I was using Crawlee (and therefore JS) I could use autoscaling: https://docs.apify.com/platform/actors/running/usage-and-resources . But is there a way to build a single Python based Actor that uses more threads/CPU cores if needed?
4 comments
g
H
A
My example use case: I want to track 1000 TikTok profiles over time. So, let's say, once, a week, i want to scrape 1000 profiles page and their each most recent video. I set up Apify, and it seems to work well. But i have three different options and I'm not certain how the cost works out for each. For my example, I took 100 random profiles and gave it to each of these actors. Each returns 89 results, asking for one video per profile. (the other 11 I assume were broken / private / bad links, that's fine)

(1) TikTok Scraper Actor: https://console.apify.com/actors/GdWCkxBtKWOsKjdch/information/latest/readme - for $49/mo. Scraping 100 example profiles with 89 results cost me $0.439, ~ 0.50 cent per result, above the $49/month(?) as well. So crawling 4000 profiles a month would cost me $49 + $20 = $69 a month.

(2) TikTok Profile Scraper Actor: https://console.apify.com/actors/0FXVyOXXEmdGcV88a/information/latest/readme - for $5 / 1000 videos (0.50 cent per video). This does end up charging me 0.50c/result. This would cost me $20 a month.

(3) Free TikTok Scraper Actor: https://console.apify.com/actors/OtzYfK1ndEGdwWFKQ/information/latest/readme - for (free?), cost me $0.379. , 0.43 cent per result. this would cost me $17.20 a month.

My question is, why would i use (1) , when it'll cost me $49/mo + 0.50 cents per result, when I can use (2) or (3)? And also, what makes (3) "free"? I'm still being charged a fee per result. What am i missing?
6 comments
g
A
In the middle of a Python Playwright run, we are getting this error:
Plain Text
ERROR Error in websocket connection
Traceback (most recent call last):
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 1301, in close_connection
           await self.transfer_data_task
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 974, in transfer_data
           await asyncio.shield(self._put_message_waiter)
       asyncio.exceptions.CancelledError

       The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
         File "/usr/local/lib/python3.11/site-packages/apify/event_manager.py", line 222, in _process_platform_messages
           async for message in websocket:
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 498, in __aiter__
           yield await self.recv()
                 ^^^^^^^^^^^^^^^^^
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 568, in recv
           await self.ensure_open()
        File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 939, in ensure_open
           raise self.connection_closed_exc()
       websockets.exceptions.ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout; no close frame received


We do not have any websocket code in our Actor. The traceback does not have any reference to our code.
7 comments
A
g
M
V
S