Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
curioussoul
c
curioussoul
Offline, last seen last month
Joined August 30, 2024
I have a custom function which open a browser to get cookies. The problem is my machine is very small and what would happen when multiple sessions are being made it will try to open many browsers at the same time. Can I somehow make the creation of sessions sequential ? so even though I need 1000s of sessions but at any point in time only one session is created and no session can be created in parallel. So only one browser instance will be running at any point in time.

createSessionFunction: async(sessionPool,options) => { var new_session = new Session({ sessionPool }); var proxyurl = await proxyConfigurationDE.newUrl(new_session.id); console.log(proxyurl, new_session.id); var cookies_g = await getCookiesDE(proxyurl,'https://www.example.com'); console.log("cookies from playwright..",cookies_g); new_session.setCookies( cookies_g, 'https://www.example.com' ); console.log("Checking cookies set..", new_session.getCookieString("example.com")) return new_session; }
2 comments
A
c
Dear all I am trying to test out sessionPool and preparing each session with some default cookies. I create a sessionPool like this

const sessionPool_de = await SessionPool.open({ maxPoolSize: 25, sessionOptions:{ maxAgeSecs: 10, maxUsageCount: 150, // for example when you know that the site blocks after 150 requests. }, persistStateKeyValueStoreId: 'main_session', persistStateKey: 'location-specific-session-pool', });

This is how I set cookies and then try to see if the cookies are properly set for this session. But its always empty and I dont see any cookies in keyvalue pairs as well.

var session1 = await sessionPool_de.getSession(); var proxyurl = await proxyConfigurationEU.newUrl(session1.id); console.log(proxyurl); var cookies_g =[ { name: 'cookie1', value: 'my-cookie' }, { name: 'cookie2', value: 'your-cookie' } ] console.log("orignal cookies ",cookies_g, "Session_main", session1.id); session1.setCookies( cookies_g, 'https://www.example.com' ); console.log("getting cookies", session1.getCookies("www.example.com"));

My session1.getCookies always returns empty. Any solution for it ? how can I debug it further ?
2 comments
c
A
Dear all, I am trying to use createSessionFunction to create a session and set some basic cookies from a response. The problem is how can I make a request to an endpoint to get a cookies inside createSessionFunction ?

My basic code is this and am wondering what is the best way to get cookies without breaking the flow of the crawler.

createSessionFunction: async(sessionPool,options) => { const session1 = await sessionPool.getSession(); const proxyurl = await proxyConfigurationEU.newUrl(session1.id); // Get cookies session1.setCookiesFromResponse(response); return session1 }
2 comments
c
R
I have a weird situation. whenever I try to access a website via crawlee and with proxy the request is blocked but with the same proxy I can access the website without any problem on my system and with many other browsers and also in incognito mode. Its really puzzling me.
Any help would be highly appreciated. Thankyou.
2 comments
H
A
Hello all, I have a special situation, website response depends on the location of the IP address. But there is a possibility to change the address. The way it works is by calling the endpoint which returns the cookies. I want to scrap the urls once I have the cookies. How can I do that with crawlee ? and how will those cookies be managed with sessions? It's a bit complicated to explain but I hope you guys get the idea of what I want. Thank you for reading that long post.
6 comments
P
c
v
Dear all, I am experimenting with cheerio crawler to scrape Amazon. I followed the tutorial online and it works for Germany but the same crawler gets detected as a bot for US. For Germany, I am using a data center proxy of Germany and it works but for USA the datacenter proxy of US doesn't work. Below is the configuration. I am building an Amazon scraper for multiple marketplaces. But this inconsistency makes it challenging.


const crawler = new CheerioCrawler({ proxyConfiguration, requestQueue: queue, useSessionPool: true, persistCookiesPerSession: true, maxRequestRetries: 20, maxRequestsPerMinute: 250, autoscaledPoolOptions:{ maxConcurrency:100, minConcurrency: 5, isFinishedFunction: async () => { // Tell the pool whether it should finish // or wait for more tasks to become available. // Return true or false return false } }, failedRequestHandler: async (context) => rebirth_requests({ ...context}), requestHandler: async (context) => router({ ...context, dbPool}) //sessionPoolOptions:{blockedStatusCodes:[]}, });
9 comments
h
c
A
Dear all, am building a simple API that upon call adds urls via crawler.addRequest() method. On the first call, it's quite fast but on the second and further calls, it's extremely slow. I thought this delay may be coming from me not using the request queue properly. This is what I found in the docs.


Note that RequestList can be used together with RequestQueue by the same crawler. In such cases, each request from RequestList is enqueued into RequestQueue first and then consumed from the latter. This is necessary to avoid the same URL being processed more than once (from the list first and then possibly from the queue). In practical terms, such a combination can be useful when there is a large number of initial URLs, but more URLs would be added dynamically by the crawler.

Can someone please give me a sample code?
5 comments
A
c
P
Dear all, how can I use socks5 proxies with crawlee ? Also in general of the proxy is password protected how to put it in proxyUrl ?
I didnt find any exmaple to use password protected proxies and socks5 proxies are not supported by defualt.

Anyway to get around it ?

Best Regards
5 comments
M
c
Dear all, after trying out broweser based data extraction. I need to enhance the crawler and make it light weight. But cheerio crawler doesnt work for sites which has cloudflare secuirty. Because it just grabs the very first html (which is from CF) and thats it.

Is there anyway to wait for its completion ? Also any suggestion to find not the actual endpoint which is loading the data ? I am using dev tools but it looks like js and data is quite strongly coupled.

Any help would be highly appreciated.

Thanks for the great work. 🙂
2 comments
c
P
I have a custom function which open a browser to get cookies. The problem is my machine is very small and what would happen when multiple sessions are being made it will try to open many browsers at the same time. Can I somehow make the creation of sessions sequential ? so even though I need 1000s of sessions but at any point in time only one session is created and no session can be created in parallel. So only one browser instance will be running at any point in time.

createSessionFunction: async(sessionPool,options) => { var new_session = new Session({ sessionPool }); var proxyurl = await proxyConfigurationDE.newUrl(new_session.id); console.log(proxyurl, new_session.id); var cookies_g = await getCookiesDE(proxyurl,'https://www.example.com'); console.log("cookies from playwright..",cookies_g); new_session.setCookies( cookies_g, 'https://www.example.com' ); console.log("Checking cookies set..", new_session.getCookieString("example.com")) return new_session; }
2 comments
A
c
Dear all I am trying to test out sessionPool and preparing each session with some default cookies. I create a sessionPool like this

const sessionPool_de = await SessionPool.open({ maxPoolSize: 25, sessionOptions:{ maxAgeSecs: 10, maxUsageCount: 150, // for example when you know that the site blocks after 150 requests. }, persistStateKeyValueStoreId: 'main_session', persistStateKey: 'location-specific-session-pool', });

This is how I set cookies and then try to see if the cookies are properly set for this session. But its always empty and I dont see any cookies in keyvalue pairs as well.

var session1 = await sessionPool_de.getSession(); var proxyurl = await proxyConfigurationEU.newUrl(session1.id); console.log(proxyurl); var cookies_g =[ { name: 'cookie1', value: 'my-cookie' }, { name: 'cookie2', value: 'your-cookie' } ] console.log("orignal cookies ",cookies_g, "Session_main", session1.id); session1.setCookies( cookies_g, 'https://www.example.com' ); console.log("getting cookies", session1.getCookies("www.example.com"));

My session1.getCookies always returns empty. Any solution for it ? how can I debug it further ?
2 comments
A
c
Dear all, I am trying to use createSessionFunction to create a session and set some basic cookies from a response. The problem is how can I make a request to an endpoint to get a cookies inside createSessionFunction ?

My basic code is this and am wondering what is the best way to get cookies without breaking the flow of the crawler.

createSessionFunction: async(sessionPool,options) => { const session1 = await sessionPool.getSession(); const proxyurl = await proxyConfigurationEU.newUrl(session1.id); // Get cookies session1.setCookiesFromResponse(response); return session1 }
2 comments
c
R
I have a weird situation. whenever I try to access a website via crawlee and with proxy the request is blocked but with the same proxy I can access the website without any problem on my system and with many other browsers and also in incognito mode. Its really puzzling me.
Any help would be highly appreciated. Thankyou.
2 comments
H
A
Hello all, I have a special situation, website response depends on the location of the IP address. But there is a possibility to change the address. The way it works is by calling the endpoint which returns the cookies. I want to scrap the urls once I have the cookies. How can I do that with crawlee ? and how will those cookies be managed with sessions? It's a bit complicated to explain but I hope you guys get the idea of what I want. Thank you for reading that long post.
6 comments
v
P
c
Dear all, I am experimenting with cheerio crawler to scrape Amazon. I followed the tutorial online and it works for Germany but the same crawler gets detected as a bot for US. For Germany, I am using a data center proxy of Germany and it works but for USA the datacenter proxy of US doesn't work. Below is the configuration. I am building an Amazon scraper for multiple marketplaces. But this inconsistency makes it challenging.


const crawler = new CheerioCrawler({ proxyConfiguration, requestQueue: queue, useSessionPool: true, persistCookiesPerSession: true, maxRequestRetries: 20, maxRequestsPerMinute: 250, autoscaledPoolOptions:{ maxConcurrency:100, minConcurrency: 5, isFinishedFunction: async () => { // Tell the pool whether it should finish // or wait for more tasks to become available. // Return true or false return false } }, failedRequestHandler: async (context) => rebirth_requests({ ...context}), requestHandler: async (context) => router({ ...context, dbPool}) //sessionPoolOptions:{blockedStatusCodes:[]}, });
9 comments
h
c
A
Dear all, am building a simple API that upon call adds urls via crawler.addRequest() method. On the first call, it's quite fast but on the second and further calls, it's extremely slow. I thought this delay may be coming from me not using the request queue properly. This is what I found in the docs.


Note that RequestList can be used together with RequestQueue by the same crawler. In such cases, each request from RequestList is enqueued into RequestQueue first and then consumed from the latter. This is necessary to avoid the same URL being processed more than once (from the list first and then possibly from the queue). In practical terms, such a combination can be useful when there is a large number of initial URLs, but more URLs would be added dynamically by the crawler.

Can someone please give me a sample code?
5 comments
A
c
P
Dear all, how can I use socks5 proxies with crawlee ? Also in general of the proxy is password protected how to put it in proxyUrl ?
I didnt find any exmaple to use password protected proxies and socks5 proxies are not supported by defualt.

Anyway to get around it ?

Best Regards
5 comments
M
c
Dear all, after trying out broweser based data extraction. I need to enhance the crawler and make it light weight. But cheerio crawler doesnt work for sites which has cloudflare secuirty. Because it just grabs the very first html (which is from CF) and thats it.

Is there anyway to wait for its completion ? Also any suggestion to find not the actual endpoint which is loading the data ? I am using dev tools but it looks like js and data is quite strongly coupled.

Any help would be highly appreciated.

Thanks for the great work. 🙂
2 comments
c
P
Dear all. I am trying to set cookies to a session. But cookies are only set if there is only name and value keys. If there is any other key present the cookie is not set. Can you please guide me how to debug it further ?

Cookies I am getting from playwright crawlers. Its a list of objects. Then I pass that list of objects to

Session. Setcookies() but its doesnt work.

Best Regards
Dear all. I am trying to set cookies to a session. But cookies are only set if there is only name and value keys. If there is any other key present the cookie is not set. Can you please guide me how to debug it further ?

Cookies I am getting from playwright crawlers. Its a list of objects. Then I pass that list of objects to

Session. Setcookies() but its doesnt work.

Best Regards
Dear all, I am trying to scrap data from a public ip. For some reason cheeriocrawler is not getting the data back but in postman I could easily get the data. Proxy ip is whitelisted because I am using the same ip for postman and for cheerio.

Postman does add some default headers but when I look at my request object the headers are empty. Does someone knows at which points cheerio sets the headers and generate some fingerprints and how can I see them ?

Request { id: 'OBTRQI5zvA4aIJ9', url: 'https://someapi.com', loadedUrl: 'https://someapi.com', uniqueKey: '22586062-3f0d-40be-b499-f1a00261b5d3', method: 'GET', payload: undefined, noRetry: false, retryCount: 0, errorMessages: [], headers: {}, userData: [Getter/Setter], handledAt: undefined }


any help would be highly appreciated. Thanks
9 comments
c
O
v
Dear all, I am trying to scrap data from a public ip. For some reason cheeriocrawler is not getting the data back but in postman I could easily get the data. Proxy ip is whitelisted because I am using the same ip for postman and for cheerio.

Postman does add some default headers but when I look at my request object the headers are empty. Does someone knows at which points cheerio sets the headers and generate some fingerprints and how can I see them ?

Request { id: 'OBTRQI5zvA4aIJ9', url: 'https://someapi.com', loadedUrl: 'https://someapi.com', uniqueKey: '22586062-3f0d-40be-b499-f1a00261b5d3', method: 'GET', payload: undefined, noRetry: false, retryCount: 0, errorMessages: [], headers: {}, userData: [Getter/Setter], handledAt: undefined }


any help would be highly appreciated. Thanks
9 comments
c
O
v
Dear all, I have bunch of proxies. Cloudflare and many other anti-bot protections check for the ip address and the timezone. I could see that because of this discripency my crawler is being detected as bot. How can I bind this launch context which has correct timezone and locale to a proxy+ browser ?

I have seen proxy configuration but how can I tell the browser being launched to use a specific context ?

Thanks in advance.
6 comments
y
L
c
Dear all, I have bunch of proxies. Cloudflare and many other anti-bot protections check for the ip address and the timezone. I could see that because of this discripency my crawler is being detected as bot. How can I bind this launch context which has correct timezone and locale to a proxy+ browser ?

I have seen proxy configuration but how can I tell the browser being launched to use a specific context ?

Thanks in advance.
6 comments
y
L
c
Dear all, I am currently experimenting to bypass cloudflare Security. I am using playwright but getting dectected. My ip is whitelisted. Can any one of you please guide me how can I create a fingerprint which let me bypass cloudflare ?

Thankyou in advance.
10 comments
A
c
A
s