Adding request via crawler.addRequest([]) is slow in ex...

At a glance

The community member is building a simple API that adds URLs via the crawler.addRequest() method. They noticed that the first call is fast, but subsequent calls are extremely slow. The community member believes this delay may be due to not using the request queue properly, and they found information in the documentation about using RequestList and RequestQueue together to avoid processing the same URL more than once.

In the comments, another community member asks for more code and information about the setup. The original community member provides more details, including that they are using Crawlee locally and not the Apify platform, and that the addRequest is the only thing the Express.js endpoint does. They also share the code they are using.

Another community member suggests an optimization: instead of adding requests to the queue inside the forEach loop, they should accumulate the requests into a list and then do a single queue.addRequests call after the loop. They also recommend using console.time and console.timeEnd to investigate what is causing the long delays.

The original community member thanks the other community member and confirms that after making the suggested change, the code is now

Useful resources

ccurioussoul

Dear all, am building a simple API that upon call adds urls via crawler.addRequest() method. On the first call, it's quite fast but on the second and further calls, it's extremely slow. I thought this delay may be coming from me not using the request queue properly. This is what I found in the docs.

Note that RequestList can be used together with RequestQueue by the same crawler. In such cases, each request from RequestList is enqueued into RequestQueue first and then consumed from the latter. This is necessary to avoid the same URL being processed more than once (from the list first and then possibly from the queue). In practical terms, such a combination can be useful when there is a large number of initial URLs, but more URLs would be added dynamically by the crawler.

Can someone please give me a sample code?

5 comments

PPepa J

Hello may you post a little bit more code? Are you running the code locally or on Apify platform? Is addRequest the only thing that your express.js endpoint do? How many requests are you adding at once this way?

ccurioussoul

Yeah sure I am adding here. I am using Crawlee locally and not using Apify platform.

app.post('/scrape', async (req, res) => {
    try {
        var startUrls = [];
        const AsinData = req.body.AsinList;
        if(typeof AsinData === 'undefined'){
          return res.status(400).json({ error: 'AsinList is undefined'});
        }
        if(AsinData.length === 0){
          return res.status(400).json({ error: 'AsinList is empty'});
          
        }else{
          const regex = /^[A-Z0-9]{10}$/;
          const isValid = AsinData.every((item) => regex.test(item));
          if (isValid) {
            console.log("All items in the list meet the criteria.");
            
          }else{
            return res.status(400).json({ error: 'All ASINS should match the patterns e.g. B0BM4ZPNV1 ' });
          }
        }

        const queue = await RequestQueue.open("test");

        console.log("Asin data",AsinData)
        await AsinData.forEach((ASIN) => {
          var url_per_asin =  {url:

${BASE_URL}/gp/ajax/asin=${ASIN}

,
          userData:{label:'test',keyword: ASIN},
          uniqueKey: uuidv4()
          }
        queue.addRequests(url_per_asin);
        startUrls.push(url_per_asin);

        });
        return res.send("Fetch started..")
       
    } catch (error) {
      // Handle any errors that occur
      console.error(error);
      res.status(500).send('Internal server error');
    }
  });

PPepa J

I see that you are calling queue.addRequests in wait AsinData.forEach.

So the first optimization I would do would be: instead of adding it in the queue inside the forEach I would just accumulate the requests into a list and the do a single queue.addRequests call after the forEach . Not sure why is the startUrls there and what is its purpose. You may also use console.time and console.timeEnd (with unique labels for each request) to investigate what is causing long times https://developer.mozilla.org/en-US/docs/Web/API/console/timeEnd

ccurioussoul

Hey thanks for looking into my code. After doing the suggested change it is faster now. I also opened the queue before and assigned it to the crawler and in the app.post I simply added more into the queue without await.

AApifyBot

just advanced to level 2! Thanks for your contributions! 🎉

Add a reply

Apify Discord Mirror

Adding request via crawler.addRequest([]) is slow in express.js app.post() method