Apify Discord Mirror

Updated 5 months ago

How to scrape different things per page

At a glance

The community member is trying to architect a use case where they have an array of different domains and want to perform various tasks on specific pages within those domains, such as finding links on the home page, classifying pricing models, searching for keywords, processing job postings, and counting blog articles. They are looking for best practices on structuring "context aware" tasks on different pages.

In the comments, the community member mentions that they have since stumbled across more advanced methods of routes/labels, and they are leaving the post in case someone finds it helpful and can confirm if that is the right path.

I'm wrapping my head around how to architect my use case. Essentially I have an array of different domain:
Plain Text
[ 'acme.com', 'foo.com, 'bar.com', 'helloworld.org' ]


I want to look for different things as I guide my crawler through the domain. For example:

  1. On the home page/root, I want to find any links that look similar to: /pricing, /security, /careers, and /blog.
  2. I then want to perform different skills on each of these potential pages. For example:
    a. On the pricing page, pass the innerHTML to ChatGPT to classify their pricing model
    b. On the security page, search for the word "SOC2"
    c. On the careers page, queue up to 100 links, and further process the individual job postings
    d. On the blog page, count the number of articles
I'm not looking for someone to help specifically with a - d but more so help me understand best practices for structuring how you might go about creating "context aware" tasks on different pages.
J
1 comment
As it typically, soon after posting I stumbled across the more advanced methods of routes/labels. I'll leave this post here in case someone finds it helpful. (And perhaps for someone to confirm that's the right path)
Add a reply
Sign up and join the conversation on Discord