Casper

Apify

Apify Crawlee GitHub

Apify Discord Mirror

Casper

C

Casper

Offline, last seen 5 months ago

Joined August 30, 2024

·

Error: Error: Cannot find module '/home/myuser/dist/main.js'

I am getting the error: Error:

Plain Text

 Cannot find module '/home/myuser/dist/main.js'

when trying to run a successfull build for my actor on Apify, but locally it works without errors.
I have disabled a lot of linting and warnings as they serve me no real purpose other than stop my typescript compliation from succeeding.
It is a Typescript Crawlee project.

Do you have any idea of what might be the cause?

9 comments

G

C

L

o

O

·

latest version of node-ts does not work and latest commit on Github is ignored

I have been having issues for quite a while now with fixing node-ts that suddenly don't understand TypeScript files and hence won't compile them to javascript that node can run. I have managed to fix it locally using tsx instead of node-ts but since tsx is not supported on apify.com the run fails immediately.

In addition when I look in the latest build log I can't recognize that the latest commit on my main branch from github is used. I have cleaned up every build and version so that only the latest build is present. The crawlee and apify docs have not given me some direction to investigate this.

This has been a major headache for a while now so I hope I can get some guidance 🙏

Does anyone have similar bugs? 🤔

8 comments

v

C

S

A

·

Multiple dataset output schemas to show scrape data in Apify console UI

"storages": {
"companyInformationDataset": "./dataset_company_information_schema.json",
"reviewsDataset": "./dataset_reviews_schema.json"
}

I have these storage schemas in actor.json file but Apify UI does not pick up to use them to display scraped data that has been saved in them. So Users do not know if scraped data was saved unless they check each named dataset. I don't save data to unnamed dataset.

3 comments

C

S

·

How to input specific URLs?

I would like to add an input to my actor as such:

"website1.com"
"website2.com"

How can I do this best in the Apify input schema and how can I extract these URL strings in Typescript code?

The below approach unfortunately does not work:

Plain Text

interface InputSchema {
  companyWebsites: string[];
  sortBy: string;
  filterByStarRating: string;
  filterBylanguage: string;
  filterByVerified: string;
  startFromPageNumber: number;
  endAtPageNumber: number;
}
const input = await Actor.getInput<InputSchema>();
let companyWebsites: string[] | undefined = input?.companyWebsites;
companyWebsites?.forEach(function (companyWebsite) {
  console.log(companyWebsite);
});

11 comments

R

C

A

·

start urls input

I can get the input from Apify in my Crawlee Playwright code and console.log() the start urls, but I am not sure how to access them because it says the start urls are of type any instead of an array of strings. Can you provide some example code for this to be extracted so I can use them as start urls in my code?

28 comments

C

P

·

Pagination works locally in Crawlee but the same actor on Apify the pagination does not work correct

I have implemented pagination that can start from eg. page 2 and end at including page 5 to scrape all the data from each page. It works correctly on my local machine and I have pushed the newest working code (newest commit id) to GitHub and then to Apify via Webhook, however, when I run the actor on Apify.com it starts at the first page instead of page 2 and does not finish at including page 5. Any suggestions on what might be wrong?

17 comments

C

L

A

y

·

input schema does not update on apify, but works locally

How can I force the INPUT.schema file to update the input schema on Apify.com? These input fileds should exist on the apify.com store page.

5 comments

L

C

·

Monitor failed runs

I have a paid actor and I have set up a scheduled task to daily check if it works for my customers. Will I automatically get an email if the run fails?

8 comments

H

A

C

P

·

Visibility into paid actor customers and the ability to provide better customer service

I would really like to get the ability to see who my customers are that use my paid actor so I can contact them and make sure that the solution fit their needs perfectly.

Even better if there would be a solution to make it more user friendly for my customers as some of them do not have that much technical knowledge, so documentation can be hard to write.

In addition it would make it possible for me to notify my customers of any breaking/non breaking changes that I have made to improve the actor.

Also I would like to be able to see how much my customers are using the product (number of runs, are they using API calls, amount of data retrieved and so on) such that I am able to determine what needs to be improved, as the customer might not always contact me thus I lose that customer because the solution is not solving their issues.

Lastly it would be really nice if I can get an email or get webhook from apify when I have either gained a customer or lost them so I don't have to manually login to the platform to view this.

These challenges make it difficult for me to provide adequate customer care and scale the number of customers that use my paid actors, so this would really help me and the apify platform.

7 comments

A

C

D

P

·

How to get the currently running run id inside scraper code?

I would like to get this information to be able to create named datasets based on this run id.

3 comments

H

C

A

·

how to create a function that can be called from the router?

I just want to segment the code in small functions. Below code not work because type is not recoqnized.

2 comments

C

L

·

Show actor log messages to the apify customer?

I would like to get some feedback on what to log to users and customers of my actor.
At the moment I log progression through each page and any data retrieved but I think this is unnecessary.
I just want to inform the user the progression and if it is running properly but I am unsure what would provide the most value.

What do you do for your paid actors?

3 comments

t

A

·

Purging storage using npx crawlee run does not work

I am trying to develop a crawlee scraper locally and it that regard I need to easily purge all data from the default and named datasets as well as request queues to test my changes.
However, it does not purge storage.
I intend to use the crawlee code in an Apify Actor.
Do you have any suggestions of what might be the issue?

2 comments

C

H

·

Has anyone found a solution to run Crawlee inside a Rest API on demand?

I have managed to get some parts of it working such that I have a NodeJs API that starts my crawler.
I have yet to manage the request queue to handle additional and concurrent API calls, so I would just like to know if someone has had any luck implementing such a solution?
My particular use case for this API requires running in my cloud instead of Apify

23 comments

A

C

L

S

R

·

How to execute javascript code with Playwright?

How can I execute "document.execCommand("insertText", false, "25810") from playwright?

7 comments

C

L

A

·

How can I use Jest for testing my crawlee scraper on my local machine with different scenarios?

I want to test different scenarios for how the input for the scraper can be used so I can run this test to make sure users do not experience problems.

7 comments

L

C

O

·

Is it possible to close any dialogs that pop up automatically?

Sometimes a dialog box might pop up on a site and I am not interested in the dialog and would just like it to be dismissed.

2 comments

C

A

·

How to scrape sites that generate elements with dynamic attributes?

I am trying to scrape a site that generates different CSS classes for the target elements I need to get the value of each time the page is rendered and there are no other attributes to select or suitable parent elements to traverse and I would prefer not using XPATH. Is it possible to decode this HTML to its original form to more easily scrape it?

Also is there any technique that would make it possible to detect changes or addition of pages?

2 comments

C

P

·

Download PDF file from URL?

Does someone know of a simple npm library to download files from a URL in Javascript/TypeScript?

8 comments

C

A

g

·

Set debug breakpoint in VS Code

Is it possible to set a debug breakpoint in VS Code when writing TypeScript code for Crawlee in to inspect the value of variables and flow of the code execution?

6 comments

e

H

C

L

·

Python SDK for Crawlee?

I can see a Python SDK for Apify has been released. Is a Python SDK also planned for Crawlee with the same functionality as Javascript/Typescript with Cheerio and Playwright?

12 comments

C

N

·

From page to End page pagination

How can I implement pagination starting from eg. page 10 to page 20 or to the last page?
Do I need to implement my own code for this or does Crawlee implement something?
I am able to see the last page on the first page of the website I am scraping.

2 comments

C

H

·

Resume crawler based on request queues from previous run locally and in apify

Is it possible to stop a crawler and resume it from the previous run's request queues?

I have a crawler that has run for a couple hours locally and I would like to add proxies to it to speed up processing speed because I am getting throttled by using 1 IP, but without starting from scratch because it will be unnecessary and a waste of time. I want to use my existing request queues. Is this possible?

Also is this possible on Apify?

4 comments

C

L

t

·

How can you find HTML element that can be clicked on to use dropdown?

I have managed to find element for some dropdowns but for a specific dropdown I cant find it.
Is there any HTML attribute I can look at to determine which can be clicked? There does not seem to be a onClick() or similar

12 comments

t

C

·

How to add images to readme from Github repo?

Currently I have !Sample reviews, but this does not work locally and in Apify, but does work when reading the readme inside Github.

14 comments

C

!

t