Difference between the scraped amount in browser vs on ...

RRuuubear

I am using the twitter scraper in Python and finding that on the browser console I am getting all 300 tweets that I request. I have a counter in my Python script that increments with each item in the client.dataset(run['defaultDatasetId']).iterate_items(). This ends up being around 90, so it seems I am only getting 1/3rd of the tweets I scrape. Anyone know why or recommend what to do?

11 comments

PPepa J

Hello , may you provide us with better an example of the code, and maybe send the runId (id you are running it on platform) to the PM, so we may try to reproduce/investigate it on our side?

Also beware of scraping with proxy from different region or with logged off account may you provide different results than you see in the browser.

RRuuubear

Hi , the run id is: taVGBlEaesj8eUwii.

This is my run input:

RRuuubear

run_input = {
"profilesDesired": 1,
"handle": [f"{user_profile}"],
"searchMode": "user",
"tweetsDesired": maxtweets, # "mode": "replies", "proxyConfig": { "useApifyProxy": True }, "extendOutputFunction": """async ({ data, item, page, request, customData, Apify }) => { return item;}""", "extendScraperFunction": """async ({ page, request, addSearch, addProfile, , addThread, addEvent, customData, Apify, signal, label }) => {

}""",
"customData": {},
"handlePageTimeoutSecs": 5000,
"maxRequestRetries": 6,
"maxIdleTimeoutSecs": 60,
"initialCookies": [],
}

RRuuubear

maxtweets was set to 300

PPepa J

I just tested it and it worked well:
My implementaion:

Plain Text

from apify import Actor
from apify_client import ApifyClient


async def main():
    async with Actor:
        # Get the value of the actor input
        actor_input = await Actor.get_input() or {}

        apify_client = ApifyClient('apify_api_************************')

        dataset = apify_client.dataset('Sh*************zz')

        dataset_items = dataset.list_items().items

        i = 1
        for item in dataset_items:
            print(i)
            i += 1

Be sure you provide right datasetId (and not the actorId)

Attachment

RRuuubear

Thanks for your reply. In your example are you pulling the dataset already scraped? I'm trying to get the data as it is scraped live.

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

PPepa J

Yea I wait for the run to finish, otherwise you would have to do some active waiting with checking the actor is still running, resolving the offset parameter for listing items based on already download items etc.

RRuuubear

Is it possible to do it live? Is quite essential for what I am building

PPepa J

I am afraid you would need to solve with by yourself, I don't know about any streaming the dataset mechanism that would be available on the platform.
But you may solve this by polling the data every few secs and checking the run status (I suggest to wait another 5 secs after the run ends, because there could still be some items being stored to the dataset).

RRuuubear

Ok, thanks for your help

Add a reply

Apify and Crawlee Official Forum

Difference between the scraped amount in browser vs on Python