Apify Discord Mirror

Updated 5 months ago

How to scrape sites that generate elements with dynamic attributes?

At a glance

The community member is trying to scrape a website that generates different CSS classes for the target elements each time the page is rendered, making it difficult to select the elements. They are looking for a way to decode the HTML to its original form to make scraping easier, and also want to know if there are techniques to detect changes or additions to the pages being scraped.

Another community member suggests an approach that involves iterating over all text nodes on the page, checking their computed CSS styles and positions, and selecting the relevant ones based on certain criteria, without relying on the HTML structure. They mention that this is still in the idea stage, and suggest the original poster provide a proof-of-concept example to help investigate further.

The original poster then provides a specific website as an example: https://www.boligportal.dk/lejligheder/odense/82m2-3-vaer-id-5276909.

Useful resources
I am trying to scrape a site that generates different CSS classes for the target elements I need to get the value of each time the page is rendered and there are no other attributes to select or suitable parent elements to traverse and I would prefer not using XPATH. Is it possible to decode this HTML to its original form to more easily scrape it?

Also is there any technique that would make it possible to detect changes or addition of pages?
P
C
2 comments
Hello It would be nice to have some example for such a website so we may investigate more.

I was thinking about creating a solution that would basically iterate over all textNodes (using xpath) on the page (since this is in most cases what you want to scrape) and checked computed css styles and computed position of the elements on the page.

This way it would be possible to obtain data based on some business input like select all textNodes with color #333, font-size: 10px+-10%, located under the navigation and right of the left menu. Not caring about HTML structure at all.

But currently it is only in state of ideas. 😦 Maybe you might put together a PoC that would be just enought for your use-case.
Add a reply
Sign up and join the conversation on Discord