curioussoul

Cookies are not saving to KeyValue pairs

Dear all I am trying to test out sessionPool and preparing each session with some default cookies. I create a sessionPool like this

const sessionPool_de = await SessionPool.open({
    maxPoolSize: 25,
    sessionOptions:{
         maxAgeSecs: 10,
         maxUsageCount: 150, // for example when you know that the site blocks after 150 requests.
    },
    persistStateKeyValueStoreId: 'main_session',
    persistStateKey: 'location-specific-session-pool',
    
});

This is how I set cookies and then try to see if the cookies are properly set for this session. But its always empty and I dont see any cookies in keyvalue pairs as well.

var session1 = await sessionPool_de.getSession();
var proxyurl = await proxyConfigurationEU.newUrl(session1.id);
console.log(proxyurl);
var cookies_g =[
  { name: 'cookie1', value: 'my-cookie' },
  { name: 'cookie2', value: 'your-cookie' }
]
  console.log("orignal cookies ",cookies_g, "Session_main", session1.id);

  
  session1.setCookies(
    cookies_g,
     'https://www.example.com'
    );

  console.log("getting cookies", session1.getCookies("www.example.com"));

My session1.getCookies always returns empty. Any solution for it ? how can I debug it further ?

2 comments

ccurioussoul

make request for cookies inside createSessionFunction

Dear all, I am trying to use createSessionFunction to create a session and set some basic cookies from a response. The problem is how can I make a request to an endpoint to get a cookies inside createSessionFunction ?

My basic code is this and am wondering what is the best way to get cookies without breaking the flow of the crawler.

 createSessionFunction: async(sessionPool,options) => {

        const session1 = await sessionPool.getSession();
        const proxyurl = await proxyConfigurationEU.newUrl(session1.id);
        // Get cookies 


       session1.setCookiesFromResponse(response);
       
    
       return session1

    }

2 comments

ccurioussoul

How does createSessionFunction create session when parallel requests are being made

I have a custom function which open a browser to get cookies. The problem is my machine is very small and what would happen when multiple sessions are being made it will try to open many browsers at the same time. Can I somehow make the creation of sessions sequential ? so even though I need 1000s of sessions but at any point in time only one session is created and no session can be created in parallel. So only one browser instance will be running at any point in time.

createSessionFunction: async(sessionPool,options) => {

        var new_session = new Session({ sessionPool });
        var proxyurl = await proxyConfigurationDE.newUrl(new_session.id);
        console.log(proxyurl, new_session.id);
        var cookies_g = await getCookiesDE(proxyurl,'https://www.example.com');
        console.log("cookies from playwright..",cookies_g);
      
        new_session.setCookies(
          cookies_g,
          'https://www.example.com'
          );
        
        console.log("Checking cookies set..", new_session.getCookieString("example.com"))
        return new_session;

      }

2 comments

ccurioussoul

Crawlee + Proxy = Blocked, My laptop + Proxy = unblocked

I have a weird situation. whenever I try to access a website via crawlee and with proxy the request is blocked but with the same proxy I can access the website without any problem on my system and with many other browsers and also in incognito mode. Its really puzzling me.
Any help would be highly appreciated. Thankyou.

2 comments

ccurioussoul

Post Request with json data to get cookies and use these cookies to to scrap further Urls

Hello all, I have a special situation, website response depends on the location of the IP address. But there is a possibility to change the address. The way it works is by calling the endpoint which returns the cookies. I want to scrap the urls once I have the cookies. How can I do that with crawlee ? and how will those cookies be managed with sessions? It's a bit complicated to explain but I hope you guys get the idea of what I want. Thank you for reading that long post.

6 comments

ccurioussoul

Cheerio Crawler works for Amazon.de but gets detected bot at amazon.com

Dear all, I am experimenting with cheerio crawler to scrape Amazon. I followed the tutorial online and it works for Germany but the same crawler gets detected as a bot for US. For Germany, I am using a data center proxy of Germany and it works but for USA the datacenter proxy of US doesn't work. Below is the configuration. I am building an Amazon scraper for multiple marketplaces. But this inconsistency makes it challenging.

const crawler = new CheerioCrawler({
    proxyConfiguration,
    requestQueue: queue,
    useSessionPool: true,
    persistCookiesPerSession: true,
    maxRequestRetries: 20,
    maxRequestsPerMinute: 250,
    autoscaledPoolOptions:{
      maxConcurrency:100,
      minConcurrency: 5,
      isFinishedFunction: async () => {
        // Tell the pool whether it should finish
        // or wait for more tasks to become available.
        // Return true or false
        return false
    }
    },
    failedRequestHandler: async (context) => rebirth_requests({ ...context}),
    requestHandler: async (context) => router({ ...context, dbPool})
    //sessionPoolOptions:{blockedStatusCodes:[]},

});

9 comments

ccurioussoul

Adding request via crawler.addRequest([]) is slow in express.js app.post() method

Dear all, am building a simple API that upon call adds urls via crawler.addRequest() method. On the first call, it's quite fast but on the second and further calls, it's extremely slow. I thought this delay may be coming from me not using the request queue properly. This is what I found in the docs.

Note that RequestList can be used together with RequestQueue by the same crawler. In such cases, each request from RequestList is enqueued into RequestQueue first and then consumed from the latter. This is necessary to avoid the same URL being processed more than once (from the list first and then possibly from the queue). In practical terms, such a combination can be useful when there is a large number of initial URLs, but more URLs would be added dynamically by the crawler.

Can someone please give me a sample code?

5 comments

ccurioussoul

socks5 passwore protected proxies

Dear all, how can I use socks5 proxies with crawlee ? Also in general of the proxy is password protected how to put it in proxyUrl ?
I didnt find any exmaple to use password protected proxies and socks5 proxies are not supported by defualt.

Anyway to get around it ?

Best Regards

5 comments

ccurioussoul

Replicate XHR requests to wait for cheerio page to load further

Dear all, after trying out broweser based data extraction. I need to enhance the crawler and make it light weight. But cheerio crawler doesnt work for sites which has cloudflare secuirty. Because it just grabs the very first html (which is from CF) and thats it.

Is there anyway to wait for its completion ? Also any suggestion to find not the actual endpoint which is loading the data ? I am using dev tools but it looks like js and data is quite strongly coupled.

Any help would be highly appreciated.

Thanks for the great work. 🙂

2 comments

ccurioussoul

setCookies only works for name and value keys

Dear all. I am trying to set cookies to a session. But cookies are only set if there is only name and value keys. If there is any other key present the cookie is not set. Can you please guide me how to debug it further ?

Cookies I am getting from playwright crawlers. Its a list of objects. Then I pass that list of objects to

Session. Setcookies() but its doesnt work.

Best Regards

ccurioussoul

Request works in Postman but doesnt work with Cheerio Crawler, request object headers empty

Dear all, I am trying to scrap data from a public ip. For some reason cheeriocrawler is not getting the data back but in postman I could easily get the data. Proxy ip is whitelisted because I am using the same ip for postman and for cheerio.

Postman does add some default headers but when I look at my request object the headers are empty. Does someone knows at which points cheerio sets the headers and generate some fingerprints and how can I see them ?

Request {
  id: 'OBTRQI5zvA4aIJ9',
  url: 'https://someapi.com',
  loadedUrl: 'https://someapi.com',
  uniqueKey: '22586062-3f0d-40be-b499-f1a00261b5d3',
  method: 'GET',
  payload: undefined,
  noRetry: false,
  retryCount: 0,
  errorMessages: [],
  headers: {},
  userData: [Getter/Setter],
  handledAt: undefined
}

any help would be highly appreciated. Thanks

9 comments

ccurioussoul

bind launch-context(timezone,locale) with proxy

Dear all, I have bunch of proxies. Cloudflare and many other anti-bot protections check for the ip address and the timezone. I could see that because of this discripency my crawler is being detected as bot. How can I bind this launch context which has correct timezone and locale to a proxy+ browser ?

I have seen proxy configuration but how can I tell the browser being launched to use a specific context ?

Thanks in advance.

6 comments

ccurioussoul

Cloudflare bypass fingerprints

Dear all, I am currently experimenting to bypass cloudflare Security. I am using playwright but getting dectected. My ip is whitelisted. Can any one of you please guide me how can I create a fingerprint which let me bypass cloudflare ?

Thankyou in advance.

10 comments

Apify Discord Mirror

Cookies are not saving to KeyValue pairs

make request for cookies inside createSessionFunction

How does createSessionFunction create session when parallel requests are being made

Crawlee + Proxy = Blocked, My laptop + Proxy = unblocked

Post Request with json data to get cookies and use these cookies to to scrap further Urls

Cheerio Crawler works for Amazon.de but gets detected bot at amazon.com

Adding request via crawler.addRequest([]) is slow in express.js app.post() method

socks5 passwore protected proxies

Replicate XHR requests to wait for cheerio page to load further

setCookies only works for name and value keys

Request works in Postman but doesnt work with Cheerio Crawler, request object headers empty

bind launch-context(timezone,locale) with proxy

Cloudflare bypass fingerprints