Apify Discord Mirror

Updated 5 months ago

start urls input

At a glance

The community member is having trouble accessing the start URLs in their Crawlee Playwright code, as the URLs are of type "any" instead of an array of strings. The community members discuss various approaches to resolve this issue, including using an interface to define the input schema, converting the input to an array of strings, and using the map() function to set the label for each URL. While there is no explicitly marked answer, the community members seem to have found a solution that involves using the map() function to set the label for each URL and running the crawler with the start URLs.

Useful resources
I can get the input from Apify in my Crawlee Playwright code and console.log() the start urls, but I am not sure how to access them because it says the start urls are of type any instead of an array of strings. Can you provide some example code for this to be extracted so I can use them as start urls in my code?
P
C
28 comments
Hi ,
I am not sure if I understand, You should be able to set the type of the input as:
Plain Text
interface InputSchema {
    startUrls: string[],
}

// ...

const input = await Actor.getInput<InputSchema>();
// input.startUrls is type of string[]
thanks, how can I add the start urls to this code?

Plain Text
  await crawler.run([
    {
      url: startUrls,
      label: "companyInfo",
    },
  ]);
I read input like this at the moment
Plain Text
 const input = (await Actor.getInput()) as Record<string, any>;
Then convert to Array of strings:
Plain Text
var companyWebsites = input.companyWebsites as Array<string>;
You mean sometihng like this? I would not recomend using as operator when it is not necessary.
Plain Text
interface InputSchema {
    startUrls: string[],
}

// ...

const input = await Actor.getInput<InputSchema>();
// input.startUrls is type of string[]

await crawler.run(input.startUrls.map((startUrl) => ({
      url: startUrl,
      label: "companyInfo",
})));
yes, something like this:
Plain Text
await Actor.main(async () => {
  const crawler = new PlaywrightCrawler({
    requestHandler: router,
  });
  await crawler.run(
    companyWebsites.map((startUrl) => ({
      url: startUrl,
      label: "companyInfo",
    }))
  );
});
`
I will convert to the schema interface as well later
I have converted to input schema now but it does not work
RROR Received one or more errors
Error: Received one or more errors
at ArrayValidator.handle (C:\Development\ApifyWebscrapers\crawlee-trustpilot-review-actor\node_modules@sapphire\shapeshift\src\validators\ArrayValidator.ts:21:14)
what do you have in INPUT.json and input_schema.json?
INPUT.json
Plain Text
{
  "runMode": "PRODUCTION",
  "companyWebsites": [
    "shopwagandtail.com",
    "trustpilot.com"
  ],
  "sortBy": "recency",
  "filterByStarRating": "5",
  "filterBylanguage": "en",
  "filterByVerified": "yes",
  "startFromPageNumber": "2",
  "endAtPageNumber": "3"
}

Plain Text
`Error: Input schema is not a valid JSON (SyntaxError: Unexpected token } in JSON at position 458)
So maybe you need:

Plain Text
await crawler.run(
    companyWebsites.map((startUrl) => ({
      url: startUrl.url,
      label: "companyInfo",
    }))
  );
});
apify vis
NOT SUPPORTED: option cache. Map is used as cache, schema object as key.
Error: Input schema is not a valid JSON (SyntaxError: Unexpected token } in JSON at position 458)
And the companyWebsites should be array of objects:
in INPUT.json
Plain Text
"companyWebsites": [{ url: "shopwagandtail.com" }, { url: "trustpilot.com"}],
arh, can I change it to just array of strings?
I fixed schema now
but still not array
The attribute in input_schema.json requires the format with objects:
Plain Text
      "editor": "requestListSources",

If you want to use array of string you have to use different editor like:
Plain Text
    "editor": "json"
what would be the easiest for me and my customer?
I am not sure if I follow. Using
Plain Text
 "editor": "requestListSources",
Is totally fine but it requires, the specific format of input. You mentioned you need different input, so I suggested you to use plain JSON editor for it. I don't know your customer, there are plenty options, you may check https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1
thanks it works now, but how can I use label with startUrl()?
await crawler.run(startUrls);
Hmm.. you may use the .map to hardcode the single label it as I already mentioned.
map just skips the next url in the array
What do you mean by that?
Plain Text
await crawler.run(
    companyWebsites.map((startUrl) => ({
      url: startUrl.url,
      label: "companyInfo", // sets the label
    }))
  );
});
hmm it seems to work, I will check tomorrow, thanks πŸ™‚
Add a reply
Sign up and join the conversation on Discord