Python package to scrap twitter's front-end easily with selenium
Project description
Twitter scraper selenium
Python's package to scrap Twitter's front-end easily with selenium.
Table of Contents
Table of Contents
Prerequisites
Installation
Installing from the source
Download the source code or clone it with:
git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium
Open terminal inside the downloaded folder:
python3 setup.py install
Installing with PyPI
pip3 install twitter-scraper-selenium
Usage
To scrap profile's tweets:
In JSON format:
from twitter_scraper_selenium import scrap_profile
microsoft = scrap_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
print(microsoft)
Output:
{
"1430938749840629773": {
"tweet_id": "1430938749840629773",
"username": "Microsoft",
"name": "Microsoft",
"profile_picture": "https://twitter.com/Microsoft/photo",
"replies": 29,
"retweets": 58,
"likes": 453,
"is_retweet": false,
"retweet_link": "",
"posted_time": "2021-08-26T17:02:38+00:00",
"content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
"hashtags": [],
"mentions": [],
"images": [],
"videos": [],
"tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
"link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
},...
}
In CSV format:
from twitter_scraper_selenium import scrap_profile
scrap_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")
Output:
tweet_id | username | name | profile_picture | replies | retweets | likes | is_retweet | retweet_link | posted_time | content | hashtags | mentions | images | videos | post_url | link |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1430938749840629773 | Microsoft | Microsoft | https://twitter.com/Microsoft/photo | 64 | 75 | 521 | False | 2021-08-26T17:02:38+00:00 | Easy to use and efficient for all – Windows 11 is committed to an accessible future. Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW |
[] | [] | [] | [] | https://twitter.com/Microsoft/status/1430938749840629773 | https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC |
...
scrap_profile()
arguments:
Argument | Argument Type | Description |
twitter_username | String | Twitter username of the account |
browser | String | Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox |
proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
tweets_count | Integer | Number of posts to scrap. Default is 10. |
output_format | String | The output format, whether JSON or CSV. Default is JSON. |
filename | String | If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed. |
directory | String | If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory. |
headless | Boolean | Whether to run crawler headlessly?. Default is True |
browser_profile | String | Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way. |
Keys of the output
Key | Type | Description |
tweet_id | String | Post Identifier(integer casted inside string) |
username | String | Username of the profile |
name | String | Name of the profile |
profile_picture | String | Profile Picture link |
replies | Integer | Number of replies of tweet |
retweets | Integer | Number of retweets of tweet |
likes | Integer | Number of likes of tweet |
is_retweet | boolean | Is the tweet a retweet? |
retweet_link | String | If it is retweet, then the retweet link else it'll be empty string |
posted_time | String | Time when tweet was posted in ISO 8601 format |
content | String | content of tweet as text |
hashtags | Array | Hashtags presents in tweet, if they're present in tweet |
mentions | Array | Mentions presents in tweet, if they're present in tweet |
images | Array | Images links, if they're present in tweet |
videos | Array | Videos links, if they're present in tweet |
tweet_url | String | URL of the tweet |
link | String | If any link is present inside tweet for some external website. |
To scrap tweets using keywords:
In JSON format:
from twitter_scraper_selenium import scrap_keyword
#scrap 10 posts by searching keyword "india" from date 30th August till date 31st August
india = scrap_keyword(keyword="india", browser="firefox",
tweets_count=10,output_format="json" ,until="2021-08-31", since="2021-08-30")
print(india)
Output:
{
"1432493306152243200": {
"tweet_id": "1432493306152243200",
"username": "TOICitiesNews",
"name": "TOI Cities",
"profile_picture": "https://twitter.com/TOICitiesNews/photo",
"replies": 0,
"retweets": 0,
"likes": 0,
"is_retweet": false,
"posted_time": "2021-08-30T23:59:53+00:00",
"content": "Paralympians rake in medals, India Inc showers them with rewards",
"hashtags": [],
"mentions": [],
"images": [],
"videos": [],
"tweet_url": "https://twitter.com/TOICitiesNews/status/1432493306152243200",
"link": "https://t.co/odmappLovL?amp=1"
},...
}
In CSV format:
from twitter_scraper_selenium import scrap_keyword
scrap_keyword(keyword="india", browser="firefox",
tweets_count=10, until="2021-08-31", since="2021-08-30",output_format="csv",filename="india")
Output:
tweet_id | username | name | profile_picture | replies | retweets | likes | is_retweet | posted_time | content | hashtags | mentions | images | videos | tweet_url | link |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1432493306152243200 | TOICitiesNews | TOI Cities | https://twitter.com/TOICitiesNews/photo | 0 | 0 | 0 | False | 2021-08-30T23:59:53+00:00 | Paralympians rake in medals, India Inc showers them with rewards | [] | [] | [] | [] | https://twitter.com/TOICitiesNews/status/1432493306152243200 | https://t.co/odmappLovL?amp=1 |
...
scrap_keyword()
arguments:
Argument | Argument Type | Description |
keyword | String | Keyword to search on twitter. |
browser | String | Which browser to use for scraping?, Only 2 are supported Chrome and Firefox,default is set to Firefox. |
until | String | Optional parameter, Until date for scraping, a end date from where search ends. Format for date is YYYY-MM-DD. |
since | String | Optional parameter, Since date for scraping, a past date from where to search from. Format for date is YYYY-MM-DD. |
proxy | Integer | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port |
tweets_count | Integer | Number of posts to scrap. Default is 10. |
output_format | String | The output format, whether JSON or CSV. Default is JSON. |
filename | String | If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as keyword passed. |
directory | String | If output parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory. |
since_id | Integer | After (NOT inclusive) a specified Snowflake ID. Example here |
max_id | Integer | At or before (inclusive) a specified Snowflake ID. Example here |
within_time | String | Search within the last number of days, hours, minutes, or seconds. Example 2d, 3h, 5m, 30s . |
headless | Boolean | Whether to run crawler headlessly?. Default is True |
browser_profile | String | Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way. |
Keys of the output
Key | Type | Description |
tweet_id | String | Post Identifier(integer casted inside string) |
username | String | Username of the profile |
name | String | Name of the profile |
profile_picture | String | Profile Picture link |
replies | Integer | Number of replies of tweet |
retweets | Integer | Number of retweets of tweet |
likes | Integer | Number of likes of tweet |
is_retweet | boolean | Is the tweet a retweet? |
posted_time | String | Time when tweet was posted in ISO 8601 format |
content | String | content of tweet as text |
hashtags | Array | Hashtags presents in tweet, if they're present in tweet |
mentions | Array | Mentions presents in tweet, if they're present in tweet |
images | Array | Images links, if they're present in tweet |
videos | Array | Videos links, if they're present in tweet |
tweet_url | String | URL of the tweet |
link | String | If any link is present inside tweet for some external website. |
To scrap topic tweets with url:
from twitter_scraper_selenium import scrap_topic
# scrap 10 tweets from steam deck topic on twitter
data = scrap_topic(filename="steamdeck", url='https://twitter.com/i/topics/1415728297065861123',
browser="firefox", tweets_count=10)
Output and key of the output is the same as scrap_keyword
:
scrap_topic()
arguments:
Arguments | Argument Type |
Description |
---|---|---|
filename | str | Filename to write result output. |
url | str | Topic url. |
browser | str | Which browser to use for scraping? Only 2 are supported Chrome and Firefox. default firefox |
proxy | str | If user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port |
tweets_count | int | Number of posts to scrap. default 10. |
output_format | str | The output format whether JSON or CSV. Default json. |
directory | str | Directory to save output file. Deafult current working directory. |
browser_profile | str | Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way. |
Using scraper with proxy (http proxy)
Just pass proxy
argument to function.
from twitter_scraper_selenium import scrap_keyword
scrap_keyword(keyword="#india", browser="firefox",tweets_count=10,output="csv",filename="india",
proxy="66.115.38.247:5678") #In IP:PORT format
Proxy that requires authentication:
from twitter_scraper_selenium import scrap_profile
microsoft_data = scrap_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
proxy="sajid:pass123@66.115.38.247:5678") # username:password@IP:PORT
print(microsoft_data)
Privacy
This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrap anything private.
LICENSE
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for twitter_scraper_selenium-3.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13b775257bfdcdf6cc20bf49a8485b7d647515b553b99b5dbfdb2b395a92c6b0 |
|
MD5 | 9a591f7e1086ee1f16bbe036a7c1e2e0 |
|
BLAKE2b-256 | 7c66d24effff85496379a5f841dfdf8ccd22f610dbd88f98b932949f1a01101d |
Hashes for twitter_scraper_selenium-3.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5cbfffea71fdee01fa0d7c84c27112597338df79a84fb4d4f374378e2ff7ea68 |
|
MD5 | beebc8260f9be80415c60a5e9f16ffbc |
|
BLAKE2b-256 | 44ac60e222e1e28dd7d2ea093613c8abe818e94996257887ecb9168102170056 |