Python package to scrap twitter's front-end easily with selenium
Project description
Twitter scraper selenium
Python's package to scrape Twitter's front-end easily with selenium.
Table of Contents
Table of Contents
- Getting Started
- Usage
- Privacy
- License
Prerequisites
Installation
Installing from the source
Download the source code or clone it with:
git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium
Open terminal inside the downloaded folder:
python3 setup.py install
Installing with PyPI
pip3 install twitter-scraper-selenium
Usage
Available Function In this Package - Summary
Function Name | Function Description | Scraping Method | Scraping Speed |
scrape_profile() |
Scrape's Twitter user's profile tweets | Browser Automation | Slow |
scrape_keyword() |
Scrape's Twitter tweets using keyword provided. | Browser Automation | Slow |
scrape_topic() |
Scrape's Twitter tweets by URL. It expects the URL of the topic. | Browser Automation | Slow |
scrape_keyword_with_api() |
Scrape's Twitter tweets by query/keywords. For an advanced search, query can be built from here. | HTTP Request | Fast |
get_profile_details() |
Scrape's Twitter user details. | HTTP Request | Fast |
scrape_topic_with_api() |
Scrape's Twitter tweets by URL. It expects the URL of the topic | Browser Automation & HTTP Request | Fast |
scrape_profile_with_api() |
Scrape's Twitter tweets by twitter profile username. It expects the username of the profile | Browser Automation & HTTP Request | Fast |
Note: HTTP Request Method sends the request to Twitter's API directly for scraping data, and Browser Automation visits that page, scroll while collecting the data.
To scrape twitter profile details:
from twitter_scraper_selenium import get_profile_details
twitter_username = "TwitterAPI"
filename = "twitter_api_data"
get_profile_details(twitter_username=twitter_username, filename=filename)
Output:
{
"id": 6253282,
"id_str": "6253282",
"name": "Twitter API",
"screen_name": "TwitterAPI",
"location": "San Francisco, CA",
"profile_location": null,
"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
"url": "https:\/\/t.co\/8IkCzCDr19",
"entities": {
"url": {
"urls": [{
"url": "https:\/\/t.co\/8IkCzCDr19",
"expanded_url": "https:\/\/developer.twitter.com",
"display_url": "developer.twitter.com",
"indices": [
0,
23
]
}]
},
"description": {
"urls": []
}
},
"protected": false,
"followers_count": 6133636,
"friends_count": 12,
"listed_count": 12936,
"created_at": "Wed May 23 06:01:13 +0000 2007",
"favourites_count": 31,
"utc_offset": null,
"time_zone": null,
"geo_enabled": null,
"verified": true,
"statuses_count": 3656,
"lang": null,
"contributors_enabled": null,
"is_translator": null,
"is_translation_enabled": null,
"profile_background_color": null,
"profile_background_image_url": null,
"profile_background_image_url_https": null,
"profile_background_tile": null,
"profile_image_url": null,
"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
"profile_banner_url": null,
"profile_link_color": null,
"profile_sidebar_border_color": null,
"profile_sidebar_fill_color": null,
"profile_text_color": null,
"profile_use_background_image": null,
"has_extended_profile": null,
"default_profile": false,
"default_profile_image": false,
"following": null,
"follow_request_sent": null,
"notifications": null,
"translator_type": null
}
get_profile_details()
arguments:
Argument | Argument Type | Description |
twitter_username | String | Twitter Username |
output_filename | String | What should be the filename where output is stored?. |
output_dir | String | What directory output file should be saved? |
proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
Keys of the output:
Detail of each key can be found here.To scrape profile's tweets:
In JSON format:
from twitter_scraper_selenium import scrape_profile
microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
print(microsoft)
Output:
{
"1430938749840629773": {
"tweet_id": "1430938749840629773",
"username": "Microsoft",
"name": "Microsoft",
"profile_picture": "https://twitter.com/Microsoft/photo",
"replies": 29,
"retweets": 58,
"likes": 453,
"is_retweet": false,
"retweet_link": "",
"posted_time": "2021-08-26T17:02:38+00:00",
"content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
"hashtags": [],
"mentions": [],
"images": [],
"videos": [],
"tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
"link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
},...
}
In CSV format:
from twitter_scraper_selenium import scrape_profile
scrape_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")
Output:
tweet_id | username | name | profile_picture | replies | retweets | likes | is_retweet | retweet_link | posted_time | content | hashtags | mentions | images | videos | post_url | link |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1430938749840629773 | Microsoft | Microsoft | https://twitter.com/Microsoft/photo | 64 | 75 | 521 | False | 2021-08-26T17:02:38+00:00 | Easy to use and efficient for all – Windows 11 is committed to an accessible future. Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW |
[] | [] | [] | [] | https://twitter.com/Microsoft/status/1430938749840629773 | https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC |
...
scrape_profile()
arguments:
Argument | Argument Type | Description |
twitter_username | String | Twitter username of the account |
browser | String | Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox |
proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
tweets_count | Integer | Number of posts to scrape. Default is 10. |
output_format | String | The output format, whether JSON or CSV. Default is JSON. |
filename | String | If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed. |
directory | String | If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory. |
headless | Boolean | Whether to run crawler headlessly?. Default is True |
browser_profile | String | Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way. |
Keys of the output
Key | Type | Description |
tweet_id | String | Post Identifier(integer casted inside string) |
username | String | Username of the profile |
name | String | Name of the profile |
profile_picture | String | Profile Picture link |
replies | Integer | Number of replies of tweet |
retweets | Integer | Number of retweets of tweet |
likes | Integer | Number of likes of tweet |
is_retweet | boolean | Is the tweet a retweet? |
retweet_link | String | If it is retweet, then the retweet link else it'll be empty string |
posted_time | String | Time when tweet was posted in ISO 8601 format |
content | String | content of tweet as text |
hashtags | Array | Hashtags presents in tweet, if they're present in tweet |
mentions | Array | Mentions presents in tweet, if they're present in tweet |
images | Array | Images links, if they're present in tweet |
videos | Array | Videos links, if they're present in tweet |
tweet_url | String | URL of the tweet |
link | String | If any link is present inside tweet for some external website. |
To scrape tweets using keywords with API:
from twitter_scraper_selenium import scrape_keyword_with_api
query = "#gaming"
tweets_count = 10
output_filename = "gaming_hashtag_data"
scrape_keyword_with_api(query=query, tweets_count=tweets_count, output_filename=output_filename)
Output:
{
"1583821467732480001": {
"tweet_url" : "https://twitter.com/yakubblackbeard/status/1583821467732480001",
"tweet_details":{
...
},
"user_details":{
...
}
}, ...
}
scrape_keyword_with_api()
arguments:
Argument | Argument Type | Description |
query | String | Query to search. The query can be built from here for advanced search. |
tweets_count | Integer | Number of tweets to scrape. |
output_filename | String | What should be the filename where output is stored?. |
output_dir | String | What directory output file should be saved? |
proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
Keys of the output:
Key | Type | Description |
tweet_url | String | URL of the tweet. |
tweet_details | Dictionary | A dictionary containing the data about the tweet. All fields which will be available inside can be checked here |
user_details | Dictionary | A dictionary containing the data about the tweet owner. All fields which will be available inside can be checked here |
To scrape tweets using keywords with browser automation
In JSON format:
from twitter_scraper_selenium import scrape_keyword
#scrape 10 posts by searching keyword "india" from date 30th August till date 31st August
india = scrape_keyword(keyword="india", browser="firefox",
tweets_count=10,output_format="json" ,until="2021-08-31", since="2021-08-30")
print(india)
Output:
{
"1432493306152243200": {
"tweet_id": "1432493306152243200",
"username": "TOICitiesNews",
"name": "TOI Cities",
"profile_picture": "https://twitter.com/TOICitiesNews/photo",
"replies": 0,
"retweets": 0,
"likes": 0,
"is_retweet": false,
"posted_time": "2021-08-30T23:59:53+00:00",
"content": "Paralympians rake in medals, India Inc showers them with rewards",
"hashtags": [],
"mentions": [],
"images": [],
"videos": [],
"tweet_url": "https://twitter.com/TOICitiesNews/status/1432493306152243200",
"link": "https://t.co/odmappLovL?amp=1"
},...
}
In CSV format:
from twitter_scraper_selenium import scrape_keyword
scrape_keyword(keyword="india", browser="firefox",
tweets_count=10, until="2021-08-31", since="2021-08-30",output_format="csv",filename="india")
Output:
tweet_id | username | name | profile_picture | replies | retweets | likes | is_retweet | posted_time | content | hashtags | mentions | images | videos | tweet_url | link |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1432493306152243200 | TOICitiesNews | TOI Cities | https://twitter.com/TOICitiesNews/photo | 0 | 0 | 0 | False | 2021-08-30T23:59:53+00:00 | Paralympians rake in medals, India Inc showers them with rewards | [] | [] | [] | [] | https://twitter.com/TOICitiesNews/status/1432493306152243200 | https://t.co/odmappLovL?amp=1 |
...
scrape_keyword()
arguments:
Argument | Argument Type | Description |
keyword | String | Keyword to search on twitter. |
browser | String | Which browser to use for scraping?, Only 2 are supported Chrome and Firefox,default is set to Firefox. |
until | String | Optional parameter, Until date for scraping, a end date from where search ends. Format for date is YYYY-MM-DD. |
since | String | Optional parameter, Since date for scraping, a past date from where to search from. Format for date is YYYY-MM-DD. |
proxy | Integer | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port |
tweets_count | Integer | Number of posts to scrape. Default is 10. |
output_format | String | The output format, whether JSON or CSV. Default is JSON. |
filename | String | If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as keyword passed. |
directory | String | If output parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory. |
since_id | Integer | After (NOT inclusive) a specified Snowflake ID. Example here |
max_id | Integer | At or before (inclusive) a specified Snowflake ID. Example here |
within_time | String | Search within the last number of days, hours, minutes, or seconds. Example 2d, 3h, 5m, 30s . |
headless | Boolean | Whether to run crawler headlessly?. Default is True |
browser_profile | String | Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way. |
Keys of the output
Key | Type | Description |
tweet_id | String | Post Identifier(integer casted inside string) |
username | String | Username of the profile |
name | String | Name of the profile |
profile_picture | String | Profile Picture link |
replies | Integer | Number of replies of tweet |
retweets | Integer | Number of retweets of tweet |
likes | Integer | Number of likes of tweet |
is_retweet | boolean | Is the tweet a retweet? |
posted_time | String | Time when tweet was posted in ISO 8601 format |
content | String | content of tweet as text |
hashtags | Array | Hashtags presents in tweet, if they're present in tweet |
mentions | Array | Mentions presents in tweet, if they're present in tweet |
images | Array | Images links, if they're present in tweet |
videos | Array | Videos links, if they're present in tweet |
tweet_url | String | URL of the tweet |
link | String | If any link is present inside tweet for some external website. |
To scrape topic tweets with URL using API
from twitter_scraper_selenium import scrape_topic_with_api
topic_url = 'https://twitter.com/i/topics/1468157909318045697'
scrape_topic_with_api(URL=topic_url, output_filename='solana_cryptocurrency', tweets_count=50)
Output:
{
"1584979408338632705": {
"tweet_url" : "https://twitter.com/AptosBullCNFT/status/1584979408338632705",
"tweet_details":{
...
},
"user_details":{
...
}
}, ...
}
scrape_topic_with_api()
arguments:
Argument | Argument Type | Description |
URL | String | Twitter's Topic URL |
tweets_count | Integer | Number of tweets to scrape. |
output_filename | String | What should be the filename where output is stored?. |
output_dir | String | What directory output file should be saved? |
proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
browser | String | Which browser to use for extracting out graphql key. Default is firefox. |
headless | String | Whether to run browser in headless mode? |
Keys of the output:
Same as scrape_keyword_with_api
To scrape topic tweets with URL using browser automation:
from twitter_scraper_selenium import scrape_topic
# scrape 10 tweets from steam deck topic on twitter
data = scrape_topic(filename="steamdeck", url='https://twitter.com/i/topics/1415728297065861123',
browser="firefox", tweets_count=10)
Keys of the output:
Same as scrape_profile
scrape_topic()
arguments:
Arguments | Argument Type |
Description |
---|---|---|
filename | str | Filename to write result output. |
URL | str | Topic URL. |
browser | str | Which browser to use for scraping? Only 2 are supported Chrome and Firefox. default firefox |
proxy | str | If user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port |
tweets_count | int | Number of posts to scrape. default 10. |
output_format | str | The output format whether JSON or CSV. Default json. |
directory | str | Directory to save output file. Deafult current working directory. |
browser_profile | str | Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way. |
To Scrap profile's tweets with API:
from twitter_scraper_selenium import scrape_profile_with_api
scrape_profile_with_api('elonmusk', output_filename='musk', tweets_count= 100)
scrape_profile_with_api()
Arguments:
Argument | Argument Type | Description |
username | String | Twitter's Profile username |
tweets_count | Integer | Number of tweets to scrape. |
output_filename | String | What should be the filename where output is stored?. |
output_dir | String | What directory output file should be saved? |
proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
browser | String | Which browser to use for extracting out graphql key. Default is firefox. |
headless | String | Whether to run browser in headless mode? |
Output:
{
"1608939190548598784": {
"tweet_url" : "https://twitter.com/elonmusk/status/1608939190548598784",
"tweet_details":{
...
},
"user_details":{
...
}
}, ...
}
Using scraper with proxy (http proxy)
Just pass proxy
argument to function.
from twitter_scraper_selenium import scrape_keyword
scrape_keyword(keyword="#india", browser="firefox",tweets_count=10,output="csv",filename="india",
proxy="66.115.38.247:5678") #In IP:PORT format
Proxy that requires authentication:
from twitter_scraper_selenium import scrape_profile
microsoft_data = scrape_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
proxy="sajid:pass123@66.115.38.247:5678") # username:password@IP:PORT
print(microsoft_data)
Privacy
This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrape anything private.
LICENSE
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file twitter_scraper_selenium-4.1.5.tar.gz
.
File metadata
- Download URL: twitter_scraper_selenium-4.1.5.tar.gz
- Upload date:
- Size: 35.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58a2ca3f63572eddd77e825a5a5b28f527d4b6c84760f8a70a596d3127ca82c0 |
|
MD5 | 72d1c4748a0834f5293cf91fe1e35dcc |
|
BLAKE2b-256 | b251fd4f1dda8d49463e932dccc30727f6468aabbedc6f3819ae394ba587ac58 |