Apify and Crawlee Official Forum

Updated 3 months ago

Twitter scraping by both keyword and profile

It is too computationally intense/slow for me to make the api call for one of the filters and do post processing with the second filter. I am wondering if you can make an api call to scrape filtering by both keyword and profile. Is this possible or can I only do one or the other? Thanks!
1
D
P
L
11 comments
I see this question is similar to the Facebook scraper post, is it the same case that you are unable to filter both simultaneously in one api call?
Hello the twitter has advanced search possibilities by itself . May you fill the form for advanced search ( https://twitter.com/search-advanced?lang=en ) and then copy paste it to the Actor's input? If it would not help, what combination of keywords and profiles, are you trying to scrape?
For some reason when I advanced search by both user and keyword on apify, it only searches the keyword. Is that supposed to happen?
which specific actor do you use? I just tried Twitter Scraper and 90% of the results are from the user I set on Input with the right keywords.
I use the same, I’m asking if it’s possible to set keyword and user and have results return the union of both
Can you give us more specific examples and step by step approach what are you trying to achieve.
Sure, so say I want to scrape all tweets by https://twitter.com/JoeBiden containing the word "president", I am current using this body of code

actorinput = { "addTweetViewCount": true, "addUserInfo": false, "browserFallback": false, "debugLog": false, "extendOutputFunction": "async ({ data, item, page, request, customData, Apify }) => {\n return item;\n}", "extendScraperFunction": "async ({ page, request, addSearch, addProfile, , addThread, addEvent, customData, Apify, signal, label }) => {\n \n}",
"fromDate": "2021-11-02",
"handle": [
"https://twitter.com/JoeBiden"
],
"handlePageTimeoutSecs": 5000,
"maxIdleTimeoutSecs": 60,
"maxRequestRetries": 6,
"mode": "own",
"profilesDesired": 10,
"proxyConfig": {
"useApifyProxy": true
},
"searchTerms": [
"president"
],
"tweetsDesired": 10000,
"useAdvancedSearch": true,
"useCheerio": true
}

headers = {
'Content-Type': 'application/json; charset=utf-8',
'Authorization': f'Bearer {api_token}'
}
data = json.dumps(actor_input)

response = requests.post(api_endpoint, headers=headers, data=data)
just advanced to level 1! Thanks for your contributions! 🎉
however it looks like the actor is retrieving tweets from any user containing the search term 'president'. I am only interested in tweets from "https://twitter.com/JoeBiden" containing the term 'president'. Thanks!
yes for this general input I am also receiving a lot unrelevant results.

That's why I suggested you to generate expression from advanced search form (on the twitter website) and use it for the searchTerms attribute. The input then looks like this:
Plain Text
{
  ...
  "searchTerms": [
    "\"president\" (from:JoeBiden) -filter:links -filter:replies"
  ],
  ...
}

Now all the results belongs to the specified twitter account.
ahh okay, i was wrongly under the impression that the api would have done this for me, thank you so much!
Add a reply
Sign up and join the conversation on Discord