Python package to scrap twitter's front-end easily with selenium
Project description
Twitter scraper selenium
Python's package to scrap Twitter's front-end easily with selenium.
Table of Contents
Table of Contents
Prerequisites
Installation
Installing from the source
Download the source code or clone it with:
git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium
Open terminal inside the downloaded folder:
python3 setup.py install
Installing with PyPI
pip3 install twitter-scraper-selenium
Usage
To scrap profile's tweets:
In JSON format:
from twitter_scraper_selenium import scrap_profile
microsoft = scrap_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
print(microsoft)
Output:
{
"1430938749840629773": {
"tweet_id": "1430938749840629773",
"username": "Microsoft",
"name": "Microsoft",
"profile_picture": "https://twitter.com/Microsoft/photo",
"replies": 29,
"retweets": 58,
"likes": 453,
"is_retweet": false,
"retweet_link": "",
"posted_time": "2021-08-26T17:02:38+00:00",
"content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
"hashtags": [],
"mentions": [],
"images": [],
"videos": [],
"tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
"link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
},...
}
In CSV format:
from twitter_scraper_selenium import scrap_profile
scrap_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")
Output:
tweet_id | username | name | profile_picture | replies | retweets | likes | is_retweet | retweet_link | posted_time | content | hashtags | mentions | images | videos | post_url | link |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1430938749840629773 | Microsoft | Microsoft | https://twitter.com/Microsoft/photo | 64 | 75 | 521 | False | 2021-08-26T17:02:38+00:00 | Easy to use and efficient for all – Windows 11 is committed to an accessible future. Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW |
[] | [] | [] | [] | https://twitter.com/Microsoft/status/1430938749840629773 | https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC |
...
scrap_profile()
arguments:
Argument | Argument Type | Description |
twitter_username | String | Twitter username of the account |
browser | String | Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox |
proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
tweets_count | Integer | Number of posts to scrap. Default is 10. |
output_format | String | The output format, whether JSON or CSV. Default is JSON. |
filename | String | If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed. |
directory | String | If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory. |
headless | Boolean | Whether to run crawler headlessly?. Default is True |
browser_profile | String | Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way. |
Keys of the output
Key | Type | Description |
tweet_id | String | Post Identifier(integer casted inside string) |
username | String | Username of the profile |
name | String | Name of the profile |
profile_picture | String | Profile Picture link |
replies | Integer | Number of replies of tweet |
retweets | Integer | Number of retweets of tweet |
likes | Integer | Number of likes of tweet |
is_retweet | boolean | Is the tweet a retweet? |
retweet_link | String | If it is retweet, then the retweet link else it'll be empty string |
posted_time | String | Time when tweet was posted in ISO 8601 format |
content | String | content of tweet as text |
hashtags | Array | Hashtags presents in tweet, if they're present in tweet |
mentions | Array | Mentions presents in tweet, if they're present in tweet |
images | Array | Images links, if they're present in tweet |
videos | Array | Videos links, if they're present in tweet |
tweet_url | String | URL of the tweet |
link | String | If any link is present inside tweet for some external website. |
To scrap tweets using keywords:
In JSON format:
from twitter_scraper_selenium import scrap_keyword
#scrap 10 posts by searching keyword "india" from date 30th August till date 31st August
india = scrap_keyword(keyword="india", browser="firefox",
tweets_count=10,output_format="json" ,until="2021-08-31", since="2021-08-30")
print(india)
Output:
{
"1432493306152243200": {
"tweet_id": "1432493306152243200",
"username": "TOICitiesNews",
"name": "TOI Cities",
"profile_picture": "https://twitter.com/TOICitiesNews/photo",
"replies": 0,
"retweets": 0,
"likes": 0,
"is_retweet": false,
"posted_time": "2021-08-30T23:59:53+00:00",
"content": "Paralympians rake in medals, India Inc showers them with rewards",
"hashtags": [],
"mentions": [],
"images": [],
"videos": [],
"tweet_url": "https://twitter.com/TOICitiesNews/status/1432493306152243200",
"link": "https://t.co/odmappLovL?amp=1"
},...
}
In CSV format:
from twitter_scraper_selenium import scrap_keyword
scrap_keyword(keyword="india", browser="firefox",
tweets_count=10, until="2021-08-31", since="2021-08-30",output_format="csv",filename="india")
Output:
tweet_id | username | name | profile_picture | replies | retweets | likes | is_retweet | posted_time | content | hashtags | mentions | images | videos | tweet_url | link |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1432493306152243200 | TOICitiesNews | TOI Cities | https://twitter.com/TOICitiesNews/photo | 0 | 0 | 0 | False | 2021-08-30T23:59:53+00:00 | Paralympians rake in medals, India Inc showers them with rewards | [] | [] | [] | [] | https://twitter.com/TOICitiesNews/status/1432493306152243200 | https://t.co/odmappLovL?amp=1 |
...
scrap_keyword()
arguments:
Argument | Argument Type | Description |
keyword | String | Keyword to search on twitter. |
browser | String | Which browser to use for scraping?, Only 2 are supported Chrome and Firefox,default is set to Firefox. |
until | String | Optional parameter, Until date for scraping, a end date from where search ends. Format for date is YYYY-MM-DD. |
since | String | Optional parameter, Since date for scraping, a past date from where to search from. Format for date is YYYY-MM-DD. |
proxy | Integer | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port |
tweets_count | Integer | Number of posts to scrap. Default is 10. |
output_format | String | The output format, whether JSON or CSV. Default is JSON. |
filename | String | If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as keyword passed. |
directory | String | If output parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory. |
since_id | Integer | After (NOT inclusive) a specified Snowflake ID. Example here |
max_id | Integer | At or before (inclusive) a specified Snowflake ID. Example here |
within_time | String | Search within the last number of days, hours, minutes, or seconds. Example 2d, 3h, 5m, 30s . |
headless | Boolean | Whether to run crawler headlessly?. Default is True |
browser_profile | String | Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way. |
Keys of the output
Key | Type | Description |
tweet_id | String | Post Identifier(integer casted inside string) |
username | String | Username of the profile |
name | String | Name of the profile |
profile_picture | String | Profile Picture link |
replies | Integer | Number of replies of tweet |
retweets | Integer | Number of retweets of tweet |
likes | Integer | Number of likes of tweet |
is_retweet | boolean | Is the tweet a retweet? |
posted_time | String | Time when tweet was posted in ISO 8601 format |
content | String | content of tweet as text |
hashtags | Array | Hashtags presents in tweet, if they're present in tweet |
mentions | Array | Mentions presents in tweet, if they're present in tweet |
images | Array | Images links, if they're present in tweet |
videos | Array | Videos links, if they're present in tweet |
tweet_url | String | URL of the tweet |
link | String | If any link is present inside tweet for some external website. |
To scrap topic tweets with url:
from twitter_scraper_selenium import scrap_topic
# scrap 10 tweets from steam deck topic on twitter
data = scrap_topic(filename="steamdeck", url='https://twitter.com/i/topics/1415728297065861123',
browser="firefox", tweets_count=10)
Output and key of the output is the same as scrap_keyword
:
scrap_topic()
arguments:
Arguments | Argument Type |
Description |
---|---|---|
filename | str | Filename to write result output. |
url | str | Topic url. |
browser | str | Which browser to use for scraping? Only 2 are supported Chrome and Firefox. default firefox |
proxy | str | If user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port |
tweets_count | int | Number of posts to scrap. default 10. |
output_format | str | The output format whether JSON or CSV. Default json. |
directory | str | Directory to save output file. Deafult current working directory. |
browser_profile | str | Path to the browser profile where cookies are stored and can be used for scraping data in an authenticated way. |
Using scraper with proxy (http proxy)
Just pass proxy
argument to function.
from twitter_scraper_selenium import scrap_keyword
scrap_keyword(keyword="#india", browser="firefox",tweets_count=10,output="csv",filename="india",
proxy="66.115.38.247:5678") #In IP:PORT format
Proxy that requires authentication:
from twitter_scraper_selenium import scrap_profile
microsoft_data = scrap_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
proxy="sajid:pass123@66.115.38.247:5678") # username:password@IP:PORT
print(microsoft_data)
Privacy
This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrap anything private.
LICENSE
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file twitter_scraper_selenium-3.0.3.tar.gz
.
File metadata
- Download URL: twitter_scraper_selenium-3.0.3.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 293109e9348eac8f70b84b643c009e1966b18b77f83d968cb96c9d8f8dfa5dcb |
|
MD5 | 5efc9e0ac1507ea64e44b666db7996de |
|
BLAKE2b-256 | f7949aa09e391ec5c035f1ad66c82a8c8bb6ecdf1154507a9ceda6c1027d05cb |
File details
Details for the file twitter_scraper_selenium-3.0.3-py3-none-any.whl
.
File metadata
- Download URL: twitter_scraper_selenium-3.0.3-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e309978b61e43d8f0bc11e589ff4f0d183e83edc698f0b3c18d73d59bdcf59c |
|
MD5 | 4afc83af6662163cf9c1862c77dd4501 |
|
BLAKE2b-256 | 6a52a373118106d6d61f097a5b0278643379e7cc7c407a0ed27869ac30435cb9 |