HonzaS

Apify Crawlee GitHub

Apify Discord Mirror

Home

Members

HonzaS

Offline, last seen last month

Joined August 30, 2024

HHonzaS

requestQueue write costs

Hi there,
was there a change how much it costs to write to the requestQueue?
Now running cost of the actor is determined predominantly by writing to the request queue and not how much CU it takes.

9 comments

HHonzaS

cheerio works on local but RequestError: Proxy responded with 400 Bad Request: 30 bytes on platform

Hi there, I have problem running cheeriocrawler with the apify czech proxies on the platform because I get this error. Crawler with the same proxies works on local, what could be the reason?

8 comments

HHonzaS

google drive integration question

Hi, I need to upload csv file to the google drive. I see there is some new integration tab on apify now, so I have some questions. Sadly there is a lot of settings but no documentation.
Is it possible to upload via this integration file from the key value store?
I have managed to convert csv file to json then push it to the dataset and then push it via the integration to the drive but there are some issues.

I do not know how to preserve the filename.
converting csv to json change the data a little.

I know there is actor for uploading to the gdrive but I like that with integration there is google sign in button so I do not need to care about permissions.
Thanks for suggestions

6 comments

HHonzaS

how to run headful on the platform?

Plain Text

Failed to launch the browser process! undefined
2023-09-01T21:35:32.962Z [141:164:0901/213532.948704:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-09-01T21:35:32.964Z [141:141:0901/213532.952831:ERROR:ozone_platform_x11.cc(240)] Missing X server or $DISPLAY
2023-09-01T21:35:32.966Z [141:141:0901/213532.952846:ERROR:env.cc(255)] The platform failed to initialize.  Exiting.

I got this error on the platform when trying to run this code:

Plain Text

const browser = await launchPuppeteer({
        useChrome: true,
        
        // Native Puppeteer options
        launchOptions: {
            headless: false,
         }});

HHonzaS

Google Sheets Import & Export Actor

Hi, I have tried oauth process with this actor https://apify.com/lukaskrivka/google-sheets. I open liveview then I click on the authorize button as is instructed here https://help.apify.com/en/articles/2424053-google-integration but then I get this App is blocked mesage.
Anybody know what to do about this? Is it problem with the app or with some privacy setting on my google account? I know this definitely worked some time ago.

12 comments

HHonzaS

crawler stops when there are still pending requests

Hi, I have run cheerio crawler run that has finished but the queue is still showin 7 requests pending, is this normal?

1 comment

HHonzaS

parsing input urls from google sheet

Hi, I have tried this feature https://docs.apify.com/platform/tutorials/crawl-urls-from-a-google-sheet
It looks like there is a bug that it does not parse out the whole url when there is comma inside it.
I have tried it on this sheet https://docs.google.com/spreadsheets/d/14eS_kezUiZ13U1zEaDrb4s7xnmerJuHwG7wiRIPwBIM/edit#gid=0 I even tried to put the url it inside " but it did not help.
Here is the result you can see that the urls requested are not the same as in the sheet.
https://api.apify.com/v2/datasets/vlTmoYRiFWawRdJsZ/items?clean=true&format=json

14 comments

HHonzaS

Is it possible to have dataset with constant url?

I want actor to fill the dataset each run. But I do not want it to add items I just want the items from that run only. So before insert I drop the dataset and then make the new named dataset with the same name and insert the items. Problem is that url is based on id of the dataset and that is different. So is there some way to have constant url?

2 comments

HHonzaS

run logs on the platform

I have run cheerio crawler on the platform and it logs line like this:

2022-09-30T08:02:44.883Z WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Cannot read properties of null (reading 'match') {"id":"PxlxlTCgnI7zPOi","url":"https://www........","retryCount":1}

I have two questions:

why is there WARN instead of ERROR? I would prefer if it is ERROR and in red color, I believe it was always like this, was it changed?
why I can't see file and line where the error ocurred?

What should I change to solve this?

2 comments

HHonzaS

got-scraping vs cheerioCrawler or sendRequest

I have weird case of url like this: https://www.firmy.cz/detail/13470923-veronika-vankova-mseno.html
With got-scraping it returns data but with cheerioCrawler or using sendRequest in BasicCrawler I get just small html with this text: 'Pravděpodobně používáte rozšíření či blokátory, jež mohou ovlivňovat načtení této stránky. Pro správnou funkčnost prosíme, deaktivujte všechna tato rozšíření a zkuste stránku načíst znovu.'
Anybody clever can tell me what is the difference? I thought that cheerioCrawler and sendRequest both use got-scraping inside.
Thanks

4 comments

HHonzaS

WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Detected a session error,

Whole error: WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Detected a session error, rotating session...
What does this error mean? It shows when there is no webpage for example here http://www.cool-rent.eu/
Aside that this is really weird error message it is then retrying even when I have maxRequestRetries: 0.
Can anything be done about it?
I have tried useSessionPool: false but it did not help.
Thanks

10 comments

HHonzaS

how to set payload in cheerio crawler preNavigationHooks

doing it like this:

Plain Text

preNavigationHooks:[async (crawlingContext, gotOptions) => {
         const { request } = crawlingContext;
             request.payload = `.......`;
 }

does not work, error: ReferenceError: page is not defined
also when I want to set headers, should I use gotOptions.headers= or request.headers=
what is the difference?

4 comments

HHonzaS

pass the cloudflare browser check

Anybody know how to pass cloudflare browser check with crawlee playwrightCrawler?
site I have problem with: https://www.g2.com/
I have tried residential proxies, no proxies, chrome and firefox browser, headful or headless but nothing works.
My chrome browser passes the check with no proxies and residential proxies too, so I guess proxy is not the problem. The problem is that cloudflare somehow knows that it is automated browser.
In apify store there is working scraper for g2 but it is written in python but atleast I know it is possible to do it.

19 comments

HHonzaS

netERR_TUNNEL_CONNECTION_FAILED

I am trying to use proxy with crawlee playwright-crawler to connect to page at non standard port (444) and I am getting this proxy error PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_TUNNEL_CONNECTION_FAILED,
any suggestions?
Without proxy it works fine on local. On platform I get timeout which could be because of banned aws ip range.

6 comments

HHonzaS

playwright and proxy problem

When I try to access page with proxy with playwright I get captcha. Without proxy it works with no problem. But what is weird that if I use the same proxy in the regular browser via SwitchyOmega extension then the page loads also without problem. So I think page somehow detects that automated browser is using proxies.
Did anyone encounter this problem?

2 comments