Apify Discord Mirror

J
Jack
Offline, last seen 2 months ago
Joined December 9, 2024
Hi I seem to be running into this issue with lock file being held? I don't need to persist state as I'm returning it in memory

Plain Text
                return callback(Object.assign(new Error('Lock file is already being held'), { code: 'ELOCKED', file }));
                                              ^

Error: Lock file is already being held
    at /Users/jwarder/Workspace/bypigeon-address/node_modules/proper-lockfile/lib/lockfile.js:68:47
    at callback (/Users/jwarder/Workspace/bypigeon-address/node_modules/graceful-fs/polyfills.js:306:20)
    at FSReqCallback.oncomplete (node:fs:202:5)
    at FSReqCallback.callbackTrampoline (node:internal/async_hooks:130:17) {
  code: 'ELOCKED',
  file: '/Users/jwarder/Workspace/bypigeon-address/storage/key_value_stores/default/SDK_SESSION_POOL_STATE.json'
}
`
If I am calling const crawler = new PlaywrightCrawler({}) is there any state being shared between the instances?
2 comments
E
J
Hi, I'm looking to introduce crawling websites into an existing workflow which doesn't suit batch processing. i.e. I want to scrape each website get the result and do some further processing downstream. I do have this working with the code attached however I imagine there's a better way to achieve this given I'll be concurrently processing this with up to 500 websites and my concern is memory allocation

Plain Text
export async function crawlWebsiteForAddresses(url: string) {
  const ukPostcodeRegex = /\b([A-Z]{1,2}[0-9][A-Z0-9]?)\s?([0-9][A-Z]{2})\b/;
  const addressSet = new AddressSet();

  const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, page, enqueueLinks, log }) => {
      const content = await page.content();

      const postcodeMatch = content.match(ukPostcodeRegex);
      if (postcodeMatch) {
        const postcode = postcodeMatch[0];
        log.info(`UK postcode found on ${request.url}: ${postcode}`);
        const addressElement = page.locator(`text=${postcode}`).first();

        if (addressElement) {
          const parentTextContent = await addressElement.evaluate((el) => (el.parentElement?.textContent ? el.parentElement?.textContent : ""));
          log.info(`Address found for postcode ${postcode}: ${parentTextContent}`);
          addressSet.add({ postcode, addressText: parentTextContent });
        }
      }

      await enqueueLinks();
    },
    maxRequestsPerCrawl: 500,
  });

  await crawler.run([url]);
  await crawler.teardown();
  return addressSet;
}