Apify Discord Mirror

Is there a way to change tls version when scraping with CheerioCrawler?
is recommend to use the Crawlee without de Apify CLI, iam using the lib because of the practive to create Crawler and i want to know the experience of another devs using in the same way that i am using
4 comments
L
J
The context of handler provides a context.log but i want to save/change the logger is used, because i want to save this, i am using the Crawly without Apify CLI
7 comments
M
J
A
I’m trying to use Apify to scrape videos from a specific subreddit. I only want to retrieve video posts, not images or text posts. I’ve tried using query prefixes like “type:video” and various filters, but I’m still getting mixed content. Has anyone successfully configured the scraper to return only video-based posts? What parameters or techniques did you use to achieve that? Any help would be appreciated. Thanks in advance!
Hi everyone, I’m trying to use Apify to scrape videos from a specific subreddit. I only want to retrieve video posts, not images or text posts. I’ve tried using query prefixes like “type:video” and various filters, but I’m still getting mixed content. Has anyone successfully configured the scraper to return only video-based posts? What parameters or techniques did you use to achieve that? Any help would be appreciated. Thanks in advance!
3 comments
L
r
Hello !

I kept reading the docs but couldn't find a clear information about this. When we use Puppeteer or Playwright we can tweak in browserPool the fingerprintGenerator. For Cheerio we have the headerGenerator from got, how we can adjust it inside the CheerioCrawler ?
2 comments
L
f
with --purge you can delete the default dataset. This does not affect the other datasets. Is there a way to purge all datasets using the CLI?
This helps protect against unexpected platform overuse. You'll be notified if your monthly usage approaches the limit. While we strive to stay within the limit, small overages may occasionally occur. If exceeded, Apify platform services will be paused to prevent additional charges.

Total monthly platform usage

$49.00 of $49.00Edit limit
Platform usage
Increase your platform usage limit to continue using Apify
1 comment
L
I have a use case where I want to have a crawler running permanently. This crawler has a tieredProxyList set up that it will iterate over in case some of them don't work. For scraping some pages I don't want to use proxies to reduce the amount of money I am spending on them (When I scrape my own page I don't want to proxy, but I want to use the same logic / handlers. Is it possible to specify either the proxy that should be used for specific requests? Or maybe even the proxy tier?

Basic Setup:

const proxyConfiguration = new ProxyConfiguration({tieredProxyUrls: [{'proxyTier1'], ['proxyTier2']]});

const crawler = new PlaywrightCrawler(
{
keepAlive: true,
proxyConfiguration: proxyConfiguration,
// ...
},
);

// ...

crawler.addRequests(requestsWhereWeWantProxies);
crawler.addRequests(requestsWhereWeDontWantProxies);

It would be nice to be able to do something like:

crawler.addRequests(requestsWhereWeWantProxies);
crawler.addRequests(requestsWhereWeDontWantProxies.map((request) => ({...request, proxy: null}));

or

const proxyConfiguration = new ProxyConfiguration({tieredProxyUrls: [{'proxyTier1'], ['proxyTier2'], [null]]});

// ...

crawler.addRequests(requestsWhereWeWantProxies);
crawler.addRequests(requestsWhereWeDontWantProxies.map((request) => ({...request, proxyTier: 2}));
3 comments
j
L
Hello!

I'm trying to run the actor using pnpm instead of npm.
In my local, running pnpm run start:dev , pnpm run start:prod and apify run works as expected.
apify push is also successful.

But, when running the actor in the platform, my main command is not executed, and these are the only logs.
Plain Text
2025-02-10T02:15:51.348Z ACTOR: Pulling Docker image of build ulyHJWVbZ9m9RZ8Ss from repository.
2025-02-10T02:16:04.019Z ACTOR: Creating Docker container.
2025-02-10T02:16:04.753Z ACTOR: Starting Docker container.



Here is my dockerfile.
Plain Text
# Base image with Playwright
FROM apify/actor-node-playwright-chrome:20 AS builder

# Install pnpm
RUN wget -qO- https://get.pnpm.io/install.sh | ENV="$HOME/.bashrc" SHELL="$(which bash)" bash -

# Use the shell form of RUN to source the .bashrc file before running pnpm
SHELL ["/bin/bash", "-c", "source /home/myuser/.bashrc"]

# Check preinstalled packages
RUN pnpm ls crawlee apify puppeteer playwright

# Copy package files first to optimize caching
COPY package*.json ./

# Install dependencies
RUN pnpm install --frozen-lockfile --audit=false

# Copy source code
COPY . ./

# Ensure correct permissions
RUN chown -R myuser:myuser .

# Build the project
RUN pnpm run build

# Final runtime image
FROM apify/actor-node-playwright-chrome:20 AS runner

# Install pnpm
RUN wget -qO- https://get.pnpm.io/install.sh | ENV="/home/myuser/.bashrc" SHELL="$(which bash)" bash -

# Use the shell form of RUN to source the .bashrc file before running pnpm
SHELL ["/bin/bash", "-c", "source /home/myuser/.bashrc"]

# Copy built application from builder
COPY --from=builder /home/myuser /home/myuser

# Set up user and working directory
USER myuser
WORKDIR /home/myuser

# Install dependencies
RUN pnpm install --frozen-lockfile --audit=false 

# Run the image. If you know you won't need headful browsers,
# you can remove the XVFB start script for a micro perf gain.
CMD ./start_xvfb_and_run_cmd.sh && pnpm run start:prod --silent
5 comments
L
m
A
t
Hi There.

I am using a proxy to crawl some sites and encounter a ERR_TUNNEL_CONNECTION_FAILED error.

I am using brightData as my proxy service. If i was to curl my proxy endpoint I get a meaningful errror. For example

HTTP/1.1 502 Could not resolve host demo-site.reckoniq.coms
x-brd-err-code: target_40001
x-brd-err-msg: Could not resolve host demo-site.reckoniq.coms
X-Luminati-Error: Could not resolve host demo-site.reckoniq.coms
x-brd-error: Could not resolve host demo-site.reckoniq.coms
Proxy-Connection: close

or

HTTP/1.1 403 Forbidden
x-brd-err-code: policy_20050
x-brd-err-msg: Forbidden: target site requires special permission. You are trying to access a target site which is not permitted by our compliance policy. In order to gain access you may need to undergo a KYC process, you can do so by filling in the form: https://brightdata.com/cp/kyc If you have already completed the KYC approval, please contact your account manager for further details.
X-Luminati-Error: Forbidden: target site requires special permission. Contact BrightData for assistance
x-brd-error: Forbidden: target site requires special permission. Contact BrightData for assistance
Proxy-Connection: close

I am wondering how i can surface those errors in crawlee rather than just getting

ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Detected a session error, rotating session...
page.goto: net::ERR_TUNNEL_CONNECTION_FAILED at https://demo-site.reckoniq.coms/

There is no response in the error handler so things are not makinging it that far

Thanks for any help
2 comments
L
t
I have a scraper using Playwright, which still works perfectly locally. It also used to work on Apify, but since today it no longer does.

Has anything been changed about how Playwright is ran on Apify? The error talks about the old Chrome headles mode being removed?

See attachment for the full logs.

Plain Text
2025-02-07T18:07:53.704Z browserType.launchPersistentContext: Target page, context or browser has been closed
2025-02-07T18:07:53.705Z Browser logs: <launching> /home/myuser/pw-browsers/chrome --disable-field-trial-config ...
2025-02-07T18:07:53.708Z <launched> pid=36
2025-02-07T18:07:53.709Z [pid=36][err] Old Headless mode has been removed from the Chrome binary.


Haven't changed anything about the default Dockerfile, here it is:
Plain Text
FROM apify/actor-node-playwright-chrome:20 AS builder
RUN npm ls crawlee apify puppeteer playwright
COPY --chown=myuser package*.json ./
RUN npm install --include=dev --audit=false
COPY --chown=myuser . ./
RUN npm run build
FROM apify/actor-node-playwright-chrome:20
RUN npm ls crawlee apify puppeteer playwright
COPY --chown=myuser package*.json ./
RUN npm --quiet set progress=false \
    && npm install --omit=dev --omit=optional \
    && echo "Installed NPM packages:" \
    && (npm list --omit=dev --all || true) \
    && echo "Node.js version:" \
    && node --version \
    && echo "NPM version:" \
    && npm --version \
    && rm -r ~/.npm
COPY --from=builder --chown=myuser /home/myuser/dist ./dist
COPY --chown=myuser . ./
CMD ./start_xvfb_and_run_cmd.sh && npm run start:prod --silent
`
5 comments
a
L
s
Is there any way how I can verify what the user will be charged when doing a run of a PPE (Pay Per Event) actor? How can I verify that the charging is set up correctly on my end?
1 comment
S
In the past I sometimes used RESIDENTIAL5 proxies, which I believed to be even better proxies than the regular RESIDENTIAL proxies. However, as of late they stopped working. Has anything changed in that regard? My scraper does no longer work and regular residential proxies keeps it blocked.
Hello, what are some ways to prevent users from taking advantage of the free trial to get all the data they need and no longer using the actor. I had recently changed the trial duration from 3 days to 2 hours. I think I want to lower the trial period again but I can't until some time later due to having changed it recently.
16 comments
A
a
X
S
В
crawlee doesnt seem to respect resource limits imposed by cgroups. This poses problems for containerised enviroments where ethier crawlee gets oom killed or silently slows to a crawl as it thinks it has much more resource available then it actually does. reading and setting the maximum ram is pretty easy

Plain Text
function getMaxMemoryMB(): number | null {
  const cgroupPath = '/sys/fs/cgroup/memory.max';

  if (!existsSync(cgroupPath)) {
    log.warning('Cgroup v2 memory limit file not found.');
    return null;
  }

  try {
    const data = readFileSync(cgroupPath, 'utf-8').trim();
    
    if (data === 'max') {
      log.warning('No memory limit set (cgroup reports "max").');
      return null;
    }

    const maxMemoryBytes = parseInt(data, 10);
    return maxMemoryBytes / (1024 * 1024); // Convert to MB
  } catch (error) {
    log.exception(error as Error, 'Error reading cgroup memory limit:');
    return null;
  }
}

this can then be used to set a reasonable RAM limit for crawlee however, the CPU limits are proving more difficult. Has anyone found a fix yet?
4 comments
C
It's already working but I'm fairly new to scraping and just want to learn the best possible practises.
The script is 300-400 lines (Typescript) total and contains a login routine + session retention, network listeners as well as DOM querying and is running on a Fastify backend.
Dm me if you are down ♥️
2 comments
В
A
Encountered below issue while deploying to Apify platform using apify push. This action is being perform in Github Actions.

-----
Warning: Detected unsettled top-level await at file:///opt/hostedtoolcache/node/23.7.0/x64/lib/node_modules/apify-cli/bin/run.js:17
await execute({ development: false, dir: import.meta.url });
----

After rerunning the failed job, deployment was successful.
Any idea about the root cause and how to avoid this?
Hi, I'm using crawlee to fetch some data but I don't know how to add my own cookies in my crawler. I'm using Playwright to fetch cookies and after that I want to pass (in a session if it is possible) them to my BeautifulSoupCrawler.
3 comments
M
f
A
Is there a way to redirect the output of multiple runs of the same scraper to the same existing dataset, appending the new records? The order doesn't matter. Due to the limitations of the scraper I am using, I need to perform thousands of runs that produce a very small amount of output that I would like to add to an existing dataset (obviously having the same format or schema). I skim through the Apify API documentation and I did not find anything about it.
22 comments
!
c
A
When using the EnqueueStrategy.SAME_HOSTNAME I noticed it does not work properly on non www urls.

In the debugger I noticed it passes origin to the _check_enqueue_strategy but it uses the context.request.loaded_url if available.
So every URL that is checked will mismatch because of the difference in hostname

I tested this with multiple urls with & without www prefix and got the same behaviour.
1 comment
M
Is there a way to fetch data from a centralized source and update it when needed in public actor runs? What is the best way to manage this within Apify mechanisms without using external services (AWS DynamoDB, Firebase)?

Since runs executed by each user in public actors occur in an isolated environment, the default KV store obtained using Actor.get_value() and Actor.set_value() is unique to each user. The share feature for a KV store created with a specific name is only applicable for specific cases where a username, etc., is provided.

Is there a way to make this available to all public actor users?
Hi Apify,

Thank you for this fine auto-scraping tool Crawlee! I wanted to try out along with the tutorial but with different url e.g. https://www.etsy.com/search?q=wooden%20box but it failed with PlaywrightCrawler.
Plain Text
// For more information, see https://crawlee.dev/
import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';


// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
     launchContext: {
        launcher: firefox,
    },
    maxRequestRetries: 1,
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, log, pushData }) {
        await page.waitForTimeout(5000);
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        // await enqueueLinks();
    },
    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 1,
    // Uncomment this option to see the browser window.
    headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://www.etsy.com/search?q=wooden%20box']);
//await crawler.run(['https://www.etsy.com']); //works
//await crawler.run(['https://www.amazon.com']); //works


It seems to fail at Checking device, I thought it injected TLS fingerprint and Browser fingperint but it seems Etsy still blocks it with 403!

Thank you!
3 comments
a
b
В
When running the example below, only the first crawler (crawler1) runs, and the second crawler (crawler2) does not work as intended. Running either crawler individually works fine, and changing the URL to something completely different also works fine. Here is an example.

Plain Text
import { PlaywrightCrawler } from 'crawlee';

export async function runExample() {
  const testPage1 =
    'https://inspections.healthunit.com/HedgehogPortal/#/18fbee00-f0a3-49e3-b323-9153b6c4924c/disclosure/facility/3448568d-737b-4b41-ab63-1f2d7a2252b5';

  const testPage2 =
    'https://inspections.healthunit.com/HedgehogPortal/#/18fbee00-f0a3-49e3-b323-9153b6c4924c/disclosure/facility/3448568d-737b-4b41-ab63-1f2d7a2252b5/inspection/ac3196c5-13e6-486c-8b9c-b85dd019fc05';

  const crawler1 = new PlaywrightCrawler({
    requestHandler: async ({ request, page, log }) => {
      const title = await page.title();
      log.info(`URL: ${request.url}\nTITLE: ${title}`);
    },
    launchContext: {
      launchOptions: {
        args: ['--ignore-certificate-errors'],
      },
    },
  });

  const crawler2 = new PlaywrightCrawler({
    requestHandler: async ({ request, page, log }) => {
      const title = await page.title();
      log.info(`URL: ${request.url}\nTITLE: ${title}`);
    },
    launchContext: {
      launchOptions: {
        args: ['--ignore-certificate-errors'],
      },
    },
  });

  await crawler1.run([testPage1]);
  await crawler2.run([testPage2]);
}

runExample();
9 comments
W
H
hey guys, not the most advanced apify user so i need help with scraping leads. Issue is i crape the max 5 k leads then try to restart the scraper and it rescrapes the same 5 k leads. how can i get it to scrape the next 5k leads
1 comment
R