Hi friends!
I've been hacking around with Apify and Crawlee for a few days now and it's a lot of fun.
I'm getting stuck on how to architect my crawler for my use-case and could really use some input:
- I'm planing to collect inputs on my internal webpage (a list of company name + city)
- I would then submit this array of objects to my Actor using the Apify API (first problem, my understanding is that I would have to JSON stringify inputs as Apify doesnt support arrays of objects as input?)
- I would then for each entry, open the landing page of the website I'm scraping and input a search field and see if I get a match and in that case I extract data and save using pushData({id, name, location, employees, ...})
Here comes what I can't wrap my head around:
- Should I be invoking an actor once per item or can I batch everything to one actor as I was thinking? My thinking was to avoid extra overhead but I also cant quite wrap my head around how proxies, multiple sessions and UA fingerprinting etc. works (seems auto-magic?). It would probably be smart to rotate Fingerprint for each "new" search so its not obvious im hitting it 30 times as the "same" browser/user.
- How can I queue URLs + userData? It seems like enqueuLink and crawler.addRequest only supports passing URLs?
Maybe I'm just thinking all wrong about how I should be using the framework. If you have a better approach or can help me with the above questions, I would be very grateful!