Crawlee does not work with cron job

I'm running a cron job on node server, but it doesn't execute after the the first run

6 comments

Plain Text

// npm i croner
import { Cron } from "croner";
// This runs every three hours, 
Cron('0 */3 * * *', { timezone: 'Europe/Amsterdam' }, async () => {
  // Your code here
});

This works for me, I use cheerioCrawler.

MMahmudul Hasan Sagar

I'm using jsdom crawler but My issue is After the first run it doesn't scrape anymore

MMahmudul Hasan Sagar

Plain Text

 const crawler = new JSDOMCrawler({
        proxyConfiguration,
        requestList,
        async requestHandler({ request, window }) {
            // await page.goto(request.url);
            console.log("request", request.userData.url);
            // const title = page.locator('article .entry .entry-title a');
            // const count = await title.count();
            const links = window.document.querySelectorAll('article .entry .entry-title a');
            let position = {}
            links.forEach((link, index) => {
                // console.log("link", request.userData.url);
                if (link.getAttribute("href") === request.userData.url) {
                    position = {
                        keyword: request.userData.keyword,
                        position: index + 1,
                    }
                }
            })

            result[request.userData.plugin] = {
                ...result[request.userData.plugin],
                url: request.userData.url,
                pluginName: request.userData.plugin,
                date: moment().format('ll'),
                keywordsData: [...result[request.userData.plugin]?.keywordsData, position]
            };
            // console.log("result", result);
        }
    })
    await crawler.run();

MMahmudul Hasan Sagar

this my crawler code. result doesn't print anything after first run

LLeMoussel

To test, can you give the url of the site.

vvin5dev18

Hey, how do you code cron job?
I am using playwright and using same code like you. But it always show all request completed after frist run.

It looks like it cahced the completed url somewhere. But I don't find related config or function to handle cache in document.

I tried crawler.teardown() and no luck, since its description is Function for cleaning up after all requests are processed..

Plain Text

    Cron('*/10 * * * * *', async () => {
      const crawler = new PlaywrightCrawler({
        requestHandler: odHandler,
      })
      await crawler.run([process.env.OD_URL])
    })

Plain Text

INFO  PlaywrightCrawler: Initializing the crawler.
INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":192}
INFO  PlaywrightCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. {"terminal":true}

Add a reply

Join on Discord

Apify and Crawlee Official Forum

Crawlee does not work with cron job