Apify and Crawlee Official Forum

Updated 4 months ago

Difference between the scraped amount in browser vs on Python

I am using the twitter scraper in Python and finding that on the browser console I am getting all 300 tweets that I request. I have a counter in my Python script that increments with each item in the client.dataset(run['defaultDatasetId']).iterate_items(). This ends up being around 90, so it seems I am only getting 1/3rd of the tweets I scrape. Anyone know why or recommend what to do?
P
R
A
11 comments
Hello , may you provide us with better an example of the code, and maybe send the runId (id you are running it on platform) to the PM, so we may try to reproduce/investigate it on our side?

Also beware of scraping with proxy from different region or with logged off account may you provide different results than you see in the browser.
Hi , the run id is: taVGBlEaesj8eUwii.

This is my run input:
run_input = {
"profilesDesired": 1,
"handle": [f"{user_profile}"],
"searchMode": "user",
"tweetsDesired": maxtweets, # "mode": "replies", "proxyConfig": { "useApifyProxy": True }, "extendOutputFunction": """async ({ data, item, page, request, customData, Apify }) => { return item;}""", "extendScraperFunction": """async ({ page, request, addSearch, addProfile, , addThread, addEvent, customData, Apify, signal, label }) => {

}""",
"customData": {},
"handlePageTimeoutSecs": 5000,
"maxRequestRetries": 6,
"maxIdleTimeoutSecs": 60,
"initialCookies": [],
}
maxtweets was set to 300


I just tested it and it worked well:
My implementaion:

Plain Text
from apify import Actor
from apify_client import ApifyClient


async def main():
    async with Actor:
        # Get the value of the actor input
        actor_input = await Actor.get_input() or {}

        apify_client = ApifyClient('apify_api_************************')

        dataset = apify_client.dataset('Sh*************zz')

        dataset_items = dataset.list_items().items

        i = 1
        for item in dataset_items:
            print(i)
            i += 1

Be sure you provide right datasetId (and not the actorId)
Attachment
image.png
Thanks for your reply. In your example are you pulling the dataset already scraped? I'm trying to get the data as it is scraped live.
just advanced to level 1! Thanks for your contributions! πŸŽ‰
Yea I wait for the run to finish, otherwise you would have to do some active waiting with checking the actor is still running, resolving the offset parameter for listing items based on already download items etc.
Is it possible to do it live? Is quite essential for what I am building
I am afraid you would need to solve with by yourself, I don't know about any streaming the dataset mechanism that would be available on the platform.
But you may solve this by polling the data every few secs and checking the run status (I suggest to wait another 5 secs after the run ends, because there could still be some items being stored to the dataset).
Ok, thanks for your help
Add a reply
Sign up and join the conversation on Discord