Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
HonzaS
H
HonzaS
Offline, last seen 2 weeks ago
Joined August 30, 2024
Hi there,
was there a change how much it costs to write to the requestQueue?
Now running cost of the actor is determined predominantly by writing to the request queue and not how much CU it takes.
9 comments
A
O
A
S
H
Hi there, I have problem running cheeriocrawler with the apify czech proxies on the platform because I get this error. Crawler with the same proxies works on local, what could be the reason?
8 comments
L
S
H
Hi, I need to upload csv file to the google drive. I see there is some new integration tab on apify now, so I have some questions. Sadly there is a lot of settings but no documentation.
Is it possible to upload via this integration file from the key value store?
I have managed to convert csv file to json then push it to the dataset and then push it via the integration to the drive but there are some issues.
  1. I do not know how to preserve the filename.
  2. converting csv to json change the data a little.
I know there is actor for uploading to the gdrive but I like that with integration there is google sign in button so I do not need to care about permissions.
Thanks for suggestions
6 comments
H
O
Plain Text
Failed to launch the browser process! undefined
2023-09-01T21:35:32.962Z [141:164:0901/213532.948704:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-09-01T21:35:32.964Z [141:141:0901/213532.952831:ERROR:ozone_platform_x11.cc(240)] Missing X server or $DISPLAY
2023-09-01T21:35:32.966Z [141:141:0901/213532.952846:ERROR:env.cc(255)] The platform failed to initialize.  Exiting.


I got this error on the platform when trying to run this code:

Plain Text
const browser = await launchPuppeteer({
        useChrome: true,
        
        // Native Puppeteer options
        launchOptions: {
            headless: false,
         }});
Hi, I have tried oauth process with this actor https://apify.com/lukaskrivka/google-sheets. I open liveview then I click on the authorize button as is instructed here https://help.apify.com/en/articles/2424053-google-integration but then I get this App is blocked mesage.
Anybody know what to do about this? Is it problem with the app or with some privacy setting on my google account? I know this definitely worked some time ago.
12 comments
H
H
L
P
A
Hi, I have run cheerio crawler run that has finished but the queue is still showin 7 requests pending, is this normal?
1 comment
P
Hi, I have tried this feature https://docs.apify.com/platform/tutorials/crawl-urls-from-a-google-sheet
It looks like there is a bug that it does not parse out the whole url when there is comma inside it.
I have tried it on this sheet https://docs.google.com/spreadsheets/d/14eS_kezUiZ13U1zEaDrb4s7xnmerJuHwG7wiRIPwBIM/edit#gid=0 I even tried to put the url it inside " but it did not help.
Here is the result you can see that the urls requested are not the same as in the sheet.
https://api.apify.com/v2/datasets/vlTmoYRiFWawRdJsZ/items?clean=true&format=json
14 comments
L
H
P
I want actor to fill the dataset each run. But I do not want it to add items I just want the items from that run only. So before insert I drop the dataset and then make the new named dataset with the same name and insert the items. Problem is that url is based on id of the dataset and that is different. So is there some way to have constant url?
2 comments
H
L
I have run cheerio crawler on the platform and it logs line like this:
2022-09-30T08:02:44.883Z WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Cannot read properties of null (reading 'match') {"id":"PxlxlTCgnI7zPOi","url":"https://www........","retryCount":1}
I have two questions:
  1. why is there WARN instead of ERROR? I would prefer if it is ERROR and in red color, I believe it was always like this, was it changed?
  2. why I can't see file and line where the error ocurred?
What should I change to solve this?
2 comments
H
L
I have weird case of url like this: https://www.firmy.cz/detail/13470923-veronika-vankova-mseno.html
With got-scraping it returns data but with cheerioCrawler or using sendRequest in BasicCrawler I get just small html with this text: 'Pravděpodobně používáte rozšíření či blokátory, jež mohou ovlivňovat načtení této stránky. Pro správnou funkčnost prosíme, deaktivujte všechna tato rozšíření a zkuste stránku načíst znovu.'
Anybody clever can tell me what is the difference? I thought that cheerioCrawler and sendRequest both use got-scraping inside.
Thanks
4 comments
H
P
v
S
Whole error: WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Detected a session error, rotating session...
What does this error mean? It shows when there is no webpage for example here http://www.cool-rent.eu/
Aside that this is really weird error message it is then retrying even when I have maxRequestRetries: 0.
Can anything be done about it?
I have tried useSessionPool: false but it did not help.
Thanks
10 comments
O
H
b
m
v
doing it like this:
Plain Text
preNavigationHooks:[async (crawlingContext, gotOptions) => {
         const { request } = crawlingContext;
             request.payload = `.......`;
 }


does not work, error: ReferenceError: page is not defined
also when I want to set headers, should I use gotOptions.headers= or request.headers=
what is the difference?
4 comments
H
A
L
Anybody know how to pass cloudflare browser check with crawlee playwrightCrawler?
site I have problem with: https://www.g2.com/
I have tried residential proxies, no proxies, chrome and firefox browser, headful or headless but nothing works.
My chrome browser passes the check with no proxies and residential proxies too, so I guess proxy is not the problem. The problem is that cloudflare somehow knows that it is automated browser.
In apify store there is working scraper for g2 but it is written in python but atleast I know it is possible to do it.
19 comments
1
H
A
s
L
A
I am trying to use proxy with crawlee playwright-crawler to connect to page at non standard port (444) and I am getting this proxy error PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_TUNNEL_CONNECTION_FAILED,
any suggestions?
Without proxy it works fine on local. On platform I get timeout which could be because of banned aws ip range.
6 comments
H
A
L
I have weird case of url like this: https://www.firmy.cz/detail/13470923-veronika-vankova-mseno.html
With got-scraping it returns data but with cheerioCrawler or using sendRequest in BasicCrawler I get just small html with this text: 'Pravděpodobně používáte rozšíření či blokátory, jež mohou ovlivňovat načtení této stránky. Pro správnou funkčnost prosíme, deaktivujte všechna tato rozšíření a zkuste stránku načíst znovu.'
Anybody clever can tell me what is the difference? I thought that cheerioCrawler and sendRequest both use got-scraping inside.
Thanks
4 comments
H
P
v
S
Whole error: WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Detected a session error, rotating session...
What does this error mean? It shows when there is no webpage for example here http://www.cool-rent.eu/
Aside that this is really weird error message it is then retrying even when I have maxRequestRetries: 0.
Can anything be done about it?
I have tried useSessionPool: false but it did not help.
Thanks
10 comments
O
H
b
m
v
doing it like this:
Plain Text
preNavigationHooks:[async (crawlingContext, gotOptions) => {
         const { request } = crawlingContext;
             request.payload = `.......`;
 }


does not work, error: ReferenceError: page is not defined
also when I want to set headers, should I use gotOptions.headers= or request.headers=
what is the difference?
4 comments
H
A
L
Anybody know how to pass cloudflare browser check with crawlee playwrightCrawler?
site I have problem with: https://www.g2.com/
I have tried residential proxies, no proxies, chrome and firefox browser, headful or headless but nothing works.
My chrome browser passes the check with no proxies and residential proxies too, so I guess proxy is not the problem. The problem is that cloudflare somehow knows that it is automated browser.
In apify store there is working scraper for g2 but it is written in python but atleast I know it is possible to do it.
19 comments
1
s
A
L
L
H
I am trying to use proxy with crawlee playwright-crawler to connect to page at non standard port (444) and I am getting this proxy error PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_TUNNEL_CONNECTION_FAILED,
any suggestions?
Without proxy it works fine on local. On platform I get timeout which could be because of banned aws ip range.
6 comments
L
H
A
When I try to access page with proxy with playwright I get captcha. Without proxy it works with no problem. But what is weird that if I use the same proxy in the regular browser via SwitchyOmega extension then the page loads also without problem. So I think page somehow detects that automated browser is using proxies.
Did anyone encounter this problem?
2 comments
H
H
When I try to access page with proxy with playwright I get captcha. Without proxy it works with no problem. But what is weird that if I use the same proxy in the regular browser via SwitchyOmega extension then the page loads also without problem. So I think page somehow detects that automated browser is using proxies.
Did anyone encounter this problem?
2 comments
H
H