How to architect my actor and scraper

mmffprr

Hi friends!

I've been hacking around with Apify and Crawlee for a few days now and it's a lot of fun.

I'm getting stuck on how to architect my crawler for my use-case and could really use some input:

I'm planing to collect inputs on my internal webpage (a list of company name + city)
I would then submit this array of objects to my Actor using the Apify API (first problem, my understanding is that I would have to JSON stringify inputs as Apify doesnt support arrays of objects as input?)
I would then for each entry, open the landing page of the website I'm scraping and input a search field and see if I get a match and in that case I extract data and save using pushData({id, name, location, employees, ...})

Here comes what I can't wrap my head around:

Should I be invoking an actor once per item or can I batch everything to one actor as I was thinking? My thinking was to avoid extra overhead but I also cant quite wrap my head around how proxies, multiple sessions and UA fingerprinting etc. works (seems auto-magic?). It would probably be smart to rotate Fingerprint for each "new" search so its not obvious im hitting it 30 times as the "same" browser/user.
How can I queue URLs + userData? It seems like enqueuLink and crawler.addRequest only supports passing URLs?

Maybe I'm just thinking all wrong about how I should be using the framework. If you have a better approach or can help me with the above questions, I would be very grateful!

8 comments

vvojtechmaslan

Hi ,

Regarding the input question:
The input can be any JSON, but on the top level, it has to be an object. I think you could just get around this by passing an object with one field - an array of your objects.

Batching multiple items should be usually the way to go, as you can process multiple requests in parallel and therefore it's more cost-effective.

The default configuration of crawlers in crawlee should be quite sensible and enough for most use cases. Generally, there is no right way to go, but it is usually a try-and-error process. I would suggest not spending too much on the optimizations unless you are actually getting blocked. In that case, you can for example modify the max error score for each session (how many times it should be blocked, before rotating).

You can also pass RequestOptions object like this:

Plain Text

crawler.addRequests([
    {
        url: 'https://apify.com',
        userData: { foo: 'bar' },
    },
    {
        url: 'https://crawlee.dev',
        userData: { foo: 'bar' },
    },
]);

with enqueueLinks, you can use either the userData field or the transformRequestFunction function:

Plain Text

enqueueLinks({
    selector: 'a',
    userData: { foo: 'bar' },
});

enqueueLinks({
    selector: 'a',
    transformRequestFunction: (request) => {
        request.userData.foo = 'bar';
        return request;
    },
});

mmffprr

Thanks ! I must have been doing something wrong as whenever I add to queue like your first example, it only runs the first request before the crawler ends. I'm using the apify run --purge command locally. Is there any specific thing that I need to signal inside the defaultHandler for it to continue with the other requests I add?

mmffprr

I'm going to start a fresh project with just the queueing of links to see if I can figure out whats not letting it run further

vvojtechmaslan

Are you enqueuing the same URL? If so, only the first request will be handled. You can get around this by specifying different uniqueKey field for each of these requests.

mmffprr

Wow! That was it. Thank you, I'm passing a uniqueKey with each input and it now works as expected.

vvojtechmaslan

Glad it worked! 😃

mmffprr

Thanks for the thorough reply. This was super helpful and cleared up all of the issues I was facing 🙂

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

Add a reply

Apify and Crawlee Official Forum

How to architect my actor and scraper