grackle

websocket error during crawls that don't seem to be related to my Actor

We're getting crashes in our self-developed Actor (Python, playwright) from an apify websocket thing. We don't use websockets on our own. We're not sure what's causing this.

Plain Text

ERROR Error in websocket connection
Traceback (most recent call last):
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 1301, in close_connection
           await self.transfer_data_task
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 974, in transfer_data
           await asyncio.shield(self._put_message_waiter)
       asyncio.exceptions.CancelledError

       The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
         File "/usr/local/lib/python3.11/site-packages/apify/event_manager.py", line 222, in _process_platform_messages
           async for message in websocket:
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 498, in __aiter__
           yield await self.recv()
                 ^^^^^^^^^^^^^^^^^
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 568, in recv
           await self.ensure_open()
        File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 939, in ensure_open
           raise self.connection_closed_exc()
       websockets.exceptions.ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout; no close frame received

1 comment

ggrackle

Actor getting SIGTERM and restarting from scratch without any user action

I have a Python requests actor that is somehow restarting itself after many hours , with the same input, and appending the results to the existing output. That's very annoying! If there was a crash it should just abort and i should be able to restart it on my own if i needed to. But restarting with the same input over again just wastes time and money for me. Does anyone know why this is happening?

Here's what the log has -- there's no other info about any errors, the crawler was running just fine right up until 11:08:34, was about 21% done, and then I get:

Plain Text

023-09-08T11:08:34.534Z ACTOR: Sending Docker container SIGTERM signal.
2023-09-08T11:08:48.281Z ACTOR: Pulling Docker image from repository.
2023-09-08T11:08:49.924Z ACTOR: Creating Docker container.
2023-09-08T11:08:50.307Z ACTOR: Starting Docker container.
2023-09-08T11:08:51.769Z INFO  Initializing actor...
2023-09-08T11:08:51.771Z INFO  System info ({"apify_sdk_version": "1.1.4", "apify_client_version": "1.4.1", "python_version": "3.11.5", "os": "linux"})

And then the crawler starts from scratch using the old input data -- it doesn't "restore" or keep any old state. Which means I've wasted a lot of time / actor $ re-crawling the same input.

The timeout set was about 10 days, and the crawler had been running for 22 hours before I noticed. Another strangeness is that the full log only goes back 5 hours from when I stopped it -- 06:02 UTC to 11:08 UTC. It should go back 22 hours.

How can i protect against this?

1 comment

ggrackle

Python crawlers running in parallel

Hi, I have a custom Python + requests Actor that works great. It's pretty simple, it works against a list of starting URLs and pulls out a piece of information per URL.

My question is: If (for example) one run of 1,000 input URLs takes an hour to complete, i would like to parallel-ize it 4 ways so that I can run 4,000 URLs in an hour.

What's the best way to do this? I could kick off 4 copies of the run with segmented data, but this seems like something Apify could support natively.

I saw that if I was using Crawlee (and therefore JS) I could use autoscaling: https://docs.apify.com/platform/actors/running/usage-and-resources . But is there a way to build a single Python based Actor that uses more threads/CPU cores if needed?

4 comments

ggrackle

Pay per result vs. monthly vs. "free" confusion

My example use case: I want to track 1000 TikTok profiles over time. So, let's say, once, a week, i want to scrape 1000 profiles page and their each most recent video. I set up Apify, and it seems to work well. But i have three different options and I'm not certain how the cost works out for each. For my example, I took 100 random profiles and gave it to each of these actors. Each returns 89 results, asking for one video per profile. (the other 11 I assume were broken / private / bad links, that's fine)

(1) TikTok Scraper Actor: https://console.apify.com/actors/GdWCkxBtKWOsKjdch/information/latest/readme - for $49/mo. Scraping 100 example profiles with 89 results cost me $0.439, ~ 0.50 cent per result, above the $49/month(?) as well. So crawling 4000 profiles a month would cost me $49 + $20 = $69 a month.

(2) TikTok Profile Scraper Actor: https://console.apify.com/actors/0FXVyOXXEmdGcV88a/information/latest/readme - for $5 / 1000 videos (0.50 cent per video). This does end up charging me 0.50c/result. This would cost me $20 a month.

(3) Free TikTok Scraper Actor: https://console.apify.com/actors/OtzYfK1ndEGdwWFKQ/information/latest/readme - for (free?), cost me $0.379. , 0.43 cent per result. this would cost me $17.20 a month.

My question is, why would i use (1) , when it'll cost me $49/mo + 0.50 cents per result, when I can use (2) or (3)? And also, what makes (3) "free"? I'm still being charged a fee per result. What am i missing?

6 comments

ggrackle

websocket error during Apify run that is not from our code

In the middle of a Python Playwright run, we are getting this error:

Plain Text

ERROR Error in websocket connection
Traceback (most recent call last):
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 1301, in close_connection
           await self.transfer_data_task
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 974, in transfer_data
           await asyncio.shield(self._put_message_waiter)
       asyncio.exceptions.CancelledError

       The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
         File "/usr/local/lib/python3.11/site-packages/apify/event_manager.py", line 222, in _process_platform_messages
           async for message in websocket:
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 498, in __aiter__
           yield await self.recv()
                 ^^^^^^^^^^^^^^^^^
         File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 568, in recv
           await self.ensure_open()
        File "/usr/local/lib/python3.11/site-packages/websockets/legacy/protocol.py", line 939, in ensure_open
           raise self.connection_closed_exc()
       websockets.exceptions.ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout; no close frame received

We do not have any websocket code in our Actor. The traceback does not have any reference to our code.

7 comments

Apify and Crawlee Official Forum

websocket error during crawls that don't seem to be related to my Actor

Actor getting SIGTERM and restarting from scratch without any user action

Python crawlers running in parallel

Pay per result vs. monthly vs. "free" confusion

websocket error during Apify run that is not from our code