kennysmithnanic

600 Results but 867 Successful Crawls?

I have a discrepancy between my results and my successful crawls in a daily running actor. This is the first time I’ve run into this issue and would like help understanding the cause. Even the Actor’s stats say that I had 700+ successful crawls, but my results dataset only has 600 items (typically it was a little over 800).

2 comments

kkennysmithnanic

New Error on Long-Running Actor

Starting this afternoon, the Logs of one of my daily actors (which has run successfully every day for the past 30 days), are being flooded with this message:

Plain Text

2024-06-13T04:28:54.490Z WARN  ApifyClient: API request failed 4 times. Max attempts: 9.
2024-06-13T04:28:54.492Z Cause:ApifyApiError: You have exceeded the rate limit of 30 requests per second
2024-06-13T04:28:54.495Z   clientMethod: RequestQueueClient.get
2024-06-13T04:28:54.497Z   statusCode: 429
2024-06-13T04:28:54.498Z   type: rate-limit-exceeded
2024-06-13T04:28:54.500Z   attempt: 4
2024-06-13T04:28:54.504Z   httpMethod: get
2024-06-13T04:28:54.505Z   path: /v2/request-queues/tMiFOqCqVhLTL7QSK
2024-06-13T04:28:54.507Z   stack:
2024-06-13T04:28:54.509Z     at makeRequest (/home/myuser/node_modules/apify-client/dist/http_client.js:184:30)
2024-06-13T04:28:54.510Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
2024-06-13T04:28:54.512Z     at async RequestQueueClient._get (/home/myuser/node_modules/apify-client/dist/base/resource_client.js:25:30)
2024-06-13T04:28:54.514Z     at async RequestQueue.open (/home/myuser/node_modules/@crawlee/core/storages/request_provider.js:614:34)
2024-06-13T04:28:54.516Z     at async PuppeteerCrawler.getRequestQueue (/home/myuser/node_modules/@crawlee/basic/internals/basic-crawler.js:595:51)
2024-06-13T04:28:54.518Z     at async PuppeteerCrawler.addRequests (/home/myuser/node_modules/@crawlee/basic/internals/basic-crawler.js:612:30)

The Actor still seems to be working properly: results are being saved to the dataset, but first I got an email alert that my scheduled task failed to run due to a misconfiguration. I haven't changed the configuration in a while, so I'm not sure how that's possible.

Anybody know how I can debug this?

2 comments

kkennysmithnanic

How to check status of Actor initiated via API?

I have a use-case where users of my app are able to initiate a new crawl from my website's front-end. I'd like to be able to pass a "crawl status" back to the user so they don't feel like their waiting in the dark for the crawl to complete.

Is there a way I can create a websock to a running Actor to provide my users with real-time feedback on the status of the Actor run? All I need is something like "Actor is running" | "Actor completed: 1 succeeded, 1 failed" | "Actor failed"

Thanks in advance!

1 comment

kkennysmithnanic

Injecting local script tag onto page

I'm trying to inject a few local .js files onto pages so they can do some tidying before I save the page HTML. Locally, it works well like this:

Plain Text

const __dirname = path.dirname(fileURLToPath(import.meta.url));
const SINGLE_FILES = [
    'single-file-bootstrap.js',
    'single-file-frames.js',
    'single-file-hooks-frames.js',
    'single-file.js'
];

export const SINGLE_FILE_PATHS = SINGLE_FILES.map(file => path.join(__dirname, 'single-file', file));

export async function addScriptsToPage(page: Page) {
    try {
        for (const scriptPath of SINGLE_FILE_PATHS) {
            await page.addScriptTag({ path: scriptPath });
        }
    } catch(e) {
        console.error('Error adding scripts to page', e);
        return { success: false, error: e };
    }
    return { success: true };
}

But when i try to run this on the apify platform, I get "No Such File Or Directory" errors. How can i reference files from my package on the apify platform so I can inject them into a page?

1 comment

kkennysmithnanic

PlaywrightCrawler actor not finishing requestQueue

I have a playwright Actor that will has 10 URLs added to its queue before i kick it off with .run(). But the actor doesn't finish all 10 URLs. It will process between 4 and 7, then the Log for the run will just show statistics message repeated every second.

Note that this happens in my local runs of this Actor as well. The total number of URLs scraped (out of 10) varies from run to run, minimum 1 URL and max 7 (of 10 total).

This is the message it shows on repeat, on my local and on Apify platform:

Plain Text

2024-05-22T22:34:24.274Z INFO  Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":35781,"requestsFinishedPerMinute":2,"requestsFailedPerMinute":0,"requestTotalDurationMillis":143124,"requestsTotal":4,"crawlerRuntimeMillis":120866,"retryHistogram":[4]}
2024-05-22T22:34:24.301Z INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":6,"desiredConcurrency":11,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}

Why would it stop pulling from the requestsQueue? There are no errors in the Actor prior to this.

8 comments

kkennysmithnanic

Resurrect Timed Out Actor via Javascript API SDK?

I have an integration configured to send a webhook event to my server when actor runs finish or time out or error out.

In the case of a time out, how do I use the Javascript API SDK to resurrect the run?

1 comment

kkennysmithnanic

Javascript API SDK: How to call multiple actors simultaneously?

I have this code which works fine to call my custom actor a single time:
const run = await client.actor("xyz").call(input);

But I want to run multiple Actors simultaneously, like this:

Plain Text

for (let input of inputs) {
        const run = await client.actor("xyz").call(input);
    }

But obviously this doesn't work because each actor has to finish before the code initiates another actor. How do i do a synchronous call like this so I can trigger multiple actors? My data is all collected via webhook when the actor finishes, so I don't need wait for the run to finish in this codebase.

3 comments

kkennysmithnanic

Change viewport from within PlaywrightCrawler router method?

I have a custom Actor that takes screenshots of webpages if the webpage meets certain criteria. I currently set viewport as a pre-navigation hook, like this:

Plain Text

preNavigationHooks: [
        async ({ page }) => {
            // await page.setViewportSize({ width, height: 1080 });
            await blocker.enableBlockingInPage(page);
            await page.setViewportSize(iPhone14ProMax.viewport);
        },
    ],

But when I find a page that passes my criteria, I would like to take a screenshot of the page using this viewport size AND a desktop viewport size.

How can I change the viewport from within the playwrightRouter function?

2 comments

kkennysmithnanic

save HTML as a SingleFile with all assets?

Has anybody had success running this SingleFile package (https://github.com/gildas-lormeau/SingleFile) within PlaywrightCrawler?

I’m trying to save HTML with all style and image assets as data strings, but it’s clunky. This package looks like it would work well if only we could use it within the crawler’s context.

Looking for ideas to integrate this or replicate its features in Playwright. Thanks

4 comments

kkennysmithnanic

Blocking network requests with crawlee PuppeteerCrawler

I'm trying to block network requests from specific domains within PuppeteerCrawler but can't get it to work.

I'd like to run something like this:

Plain Text

page.on('request', (req) => {
                // If the URL doesn't include our keyword, ignore it
                if (req.url().includes('bouncex')) {
                    req.abort();
                    return;
                };
                req.continue();
            });

But it has to be initiated before page.goto.

I tried adding it to preNavigationHooks like so:

Plain Text

preNavigationHooks: [
        async ({ page }, goToOptions) => {
            goToOptions!.waitUntil = "networkidle2";
            goToOptions!.timeout = 3600000;
            await blocker.enableBlockingInPage(page);
            page.on('request', (req) => {
                // If the URL doesn't include our keyword, ignore it
                if (req.url().includes('bouncex')) {
                    req.abort();
                    return;
                };
                req.continue();
            });
            await page.setViewport(viewportConfig);
        },
    ],

But this returns Error: Request is already handled!

Is there a way to do this with PuppeteerCrawler?

3 comments

Apify Discord Mirror

600 Results but 867 Successful Crawls?

New Error on Long-Running Actor

How to check status of Actor initiated via API?

Injecting local script tag onto page

PlaywrightCrawler actor not finishing requestQueue

Resurrect Timed Out Actor via Javascript API SDK?

Javascript API SDK: How to call multiple actors simultaneously?

Change viewport from within PlaywrightCrawler router method?

save HTML as a SingleFile with all assets?

Blocking network requests with crawlee PuppeteerCrawler