context.log
but i want to save/change the logger is used, because i want to save this, i am using the Crawly without Apify CLI--purge
you can delete the default dataset. This does not affect the other datasets. Is there a way to purge all datasets using the CLI?pnpm run start:dev
, pnpm run start:prod
and apify run
works as expected.apify push
is also successful. 2025-02-10T02:15:51.348Z ACTOR: Pulling Docker image of build ulyHJWVbZ9m9RZ8Ss from repository. 2025-02-10T02:16:04.019Z ACTOR: Creating Docker container. 2025-02-10T02:16:04.753Z ACTOR: Starting Docker container.
# Base image with Playwright FROM apify/actor-node-playwright-chrome:20 AS builder # Install pnpm RUN wget -qO- https://get.pnpm.io/install.sh | ENV="$HOME/.bashrc" SHELL="$(which bash)" bash - # Use the shell form of RUN to source the .bashrc file before running pnpm SHELL ["/bin/bash", "-c", "source /home/myuser/.bashrc"] # Check preinstalled packages RUN pnpm ls crawlee apify puppeteer playwright # Copy package files first to optimize caching COPY package*.json ./ # Install dependencies RUN pnpm install --frozen-lockfile --audit=false # Copy source code COPY . ./ # Ensure correct permissions RUN chown -R myuser:myuser . # Build the project RUN pnpm run build # Final runtime image FROM apify/actor-node-playwright-chrome:20 AS runner # Install pnpm RUN wget -qO- https://get.pnpm.io/install.sh | ENV="/home/myuser/.bashrc" SHELL="$(which bash)" bash - # Use the shell form of RUN to source the .bashrc file before running pnpm SHELL ["/bin/bash", "-c", "source /home/myuser/.bashrc"] # Copy built application from builder COPY --from=builder /home/myuser /home/myuser # Set up user and working directory USER myuser WORKDIR /home/myuser # Install dependencies RUN pnpm install --frozen-lockfile --audit=false # Run the image. If you know you won't need headful browsers, # you can remove the XVFB start script for a micro perf gain. CMD ./start_xvfb_and_run_cmd.sh && pnpm run start:prod --silent
2025-02-07T18:07:53.704Z browserType.launchPersistentContext: Target page, context or browser has been closed 2025-02-07T18:07:53.705Z Browser logs: <launching> /home/myuser/pw-browsers/chrome --disable-field-trial-config ... 2025-02-07T18:07:53.708Z <launched> pid=36 2025-02-07T18:07:53.709Z [pid=36][err] Old Headless mode has been removed from the Chrome binary.
FROM apify/actor-node-playwright-chrome:20 AS builder RUN npm ls crawlee apify puppeteer playwright COPY --chown=myuser package*.json ./ RUN npm install --include=dev --audit=false COPY --chown=myuser . ./ RUN npm run build FROM apify/actor-node-playwright-chrome:20 RUN npm ls crawlee apify puppeteer playwright COPY --chown=myuser package*.json ./ RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version \ && rm -r ~/.npm COPY --from=builder --chown=myuser /home/myuser/dist ./dist COPY --chown=myuser . ./ CMD ./start_xvfb_and_run_cmd.sh && npm run start:prod --silent
function getMaxMemoryMB(): number | null { const cgroupPath = '/sys/fs/cgroup/memory.max'; if (!existsSync(cgroupPath)) { log.warning('Cgroup v2 memory limit file not found.'); return null; } try { const data = readFileSync(cgroupPath, 'utf-8').trim(); if (data === 'max') { log.warning('No memory limit set (cgroup reports "max").'); return null; } const maxMemoryBytes = parseInt(data, 10); return maxMemoryBytes / (1024 * 1024); // Convert to MB } catch (error) { log.exception(error as Error, 'Error reading cgroup memory limit:'); return null; } }
apify push
. This action is being perform in Github Actions. EnqueueStrategy.SAME_HOSTNAME
I noticed it does not work properly on non www
urls.origin
to the _check_enqueue_strategy
but it uses the context.request.loaded_url
if available.www
prefix and got the same behaviour.// For more information, see https://crawlee.dev/ import { PlaywrightCrawler } from 'crawlee'; import { firefox } from 'playwright'; // PlaywrightCrawler crawls the web using a headless // browser controlled by the Playwright library. const crawler = new PlaywrightCrawler({ launchContext: { launcher: firefox, }, maxRequestRetries: 1, // Use the requestHandler to process each of the crawled pages. async requestHandler({ request, page, enqueueLinks, log, pushData }) { await page.waitForTimeout(5000); const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`); // Save results as JSON to ./storage/datasets/default await pushData({ title, url: request.loadedUrl }); // Extract links from the current page // and add them to the crawling queue. // await enqueueLinks(); }, // Comment this option to scrape the full website. maxRequestsPerCrawl: 1, // Uncomment this option to see the browser window. headless: false, }); // Add first URL to the queue and start the crawl. await crawler.run(['https://www.etsy.com/search?q=wooden%20box']); //await crawler.run(['https://www.etsy.com']); //works //await crawler.run(['https://www.amazon.com']); //works
import { PlaywrightCrawler } from 'crawlee'; export async function runExample() { const testPage1 = 'https://inspections.healthunit.com/HedgehogPortal/#/18fbee00-f0a3-49e3-b323-9153b6c4924c/disclosure/facility/3448568d-737b-4b41-ab63-1f2d7a2252b5'; const testPage2 = 'https://inspections.healthunit.com/HedgehogPortal/#/18fbee00-f0a3-49e3-b323-9153b6c4924c/disclosure/facility/3448568d-737b-4b41-ab63-1f2d7a2252b5/inspection/ac3196c5-13e6-486c-8b9c-b85dd019fc05'; const crawler1 = new PlaywrightCrawler({ requestHandler: async ({ request, page, log }) => { const title = await page.title(); log.info(`URL: ${request.url}\nTITLE: ${title}`); }, launchContext: { launchOptions: { args: ['--ignore-certificate-errors'], }, }, }); const crawler2 = new PlaywrightCrawler({ requestHandler: async ({ request, page, log }) => { const title = await page.title(); log.info(`URL: ${request.url}\nTITLE: ${title}`); }, launchContext: { launchOptions: { args: ['--ignore-certificate-errors'], }, }, }); await crawler1.run([testPage1]); await crawler2.run([testPage2]); } runExample();