Python package to scrap twitter's front-end easily with selenium
Project description
Twitter scraper selenium
Python's package to scrape Twitter's front-end easily with selenium.
Table of Contents
Table of Contents
Prerequisites
Installation
Installing from the source
Download the source code or clone it with:
git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium
Open terminal inside the downloaded folder:
python3 setup.py install
Installing with PyPI
pip3 install twitter-scraper-selenium
Usage
Available Function In this Package - Summary
Function Name | Function Description | Scraping Method | Scraping Speed |
scrape_profile() |
Scrape's Twitter user's profile tweets | Browser Automation | Slow |
get_profile_details() |
Scrape's Twitter user details. | HTTP Request | Fast |
scrape_profile_with_api() |
Scrape's Twitter tweets by twitter profile username. It expects the username of the profile | Browser Automation & HTTP Request | Fast |
Note: HTTP Request Method sends the request to Twitter's API directly for scraping data, and Browser Automation visits that page, scroll while collecting the data.
To scrape twitter profile details:
from twitter_scraper_selenium import get_profile_details
twitter_username = "TwitterAPI"
filename = "twitter_api_data"
browser = "firefox"
headless = True
get_profile_details(twitter_username=twitter_username, filename=filename, browser=browser, headless=headless)
Output:
{
"id": 6253282,
"id_str": "6253282",
"name": "Twitter API",
"screen_name": "TwitterAPI",
"location": "San Francisco, CA",
"profile_location": null,
"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
"url": "https:\/\/t.co\/8IkCzCDr19",
"entities": {
"url": {
"urls": [{
"url": "https:\/\/t.co\/8IkCzCDr19",
"expanded_url": "https:\/\/developer.twitter.com",
"display_url": "developer.twitter.com",
"indices": [
0,
23
]
}]
},
"description": {
"urls": []
}
},
"protected": false,
"followers_count": 6133636,
"friends_count": 12,
"listed_count": 12936,
"created_at": "Wed May 23 06:01:13 +0000 2007",
"favourites_count": 31,
"utc_offset": null,
"time_zone": null,
"geo_enabled": null,
"verified": true,
"statuses_count": 3656,
"lang": null,
"contributors_enabled": null,
"is_translator": null,
"is_translation_enabled": null,
"profile_background_color": null,
"profile_background_image_url": null,
"profile_background_image_url_https": null,
"profile_background_tile": null,
"profile_image_url": null,
"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
"profile_banner_url": null,
"profile_link_color": null,
"profile_sidebar_border_color": null,
"profile_sidebar_fill_color": null,
"profile_text_color": null,
"profile_use_background_image": null,
"has_extended_profile": null,
"default_profile": false,
"default_profile_image": false,
"following": null,
"follow_request_sent": null,
"notifications": null,
"translator_type": null
}
get_profile_details()
arguments:
Argument | Argument Type | Description |
twitter_username | String | Twitter Username |
output_filename | String | What should be the filename where output is stored?. |
output_dir | String | What directory output file should be saved? |
proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
Keys of the output:
Detail of each key can be found here.To scrape profile's tweets:
In JSON format:
from twitter_scraper_selenium import scrape_profile
microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
print(microsoft)
Output:
{
"1430938749840629773": {
"tweet_id": "1430938749840629773",
"username": "Microsoft",
"name": "Microsoft",
"profile_picture": "https://twitter.com/Microsoft/photo",
"replies": 29,
"retweets": 58,
"likes": 453,
"is_retweet": false,
"retweet_link": "",
"posted_time": "2021-08-26T17:02:38+00:00",
"content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
"hashtags": [],
"mentions": [],
"images": [],
"videos": [],
"tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
"link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
},...
}
In CSV format:
from twitter_scraper_selenium import scrape_profile
scrape_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")
Output:
tweet_id | username | name | profile_picture | replies | retweets | likes | is_retweet | retweet_link | posted_time | content | hashtags | mentions | images | videos | post_url | link |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1430938749840629773 | Microsoft | Microsoft | https://twitter.com/Microsoft/photo | 64 | 75 | 521 | False | 2021-08-26T17:02:38+00:00 | Easy to use and efficient for all – Windows 11 is committed to an accessible future. Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW |
[] | [] | [] | [] | https://twitter.com/Microsoft/status/1430938749840629773 | https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC |
...
scrape_profile()
arguments:
Argument | Argument Type | Description |
twitter_username | String | Twitter username of the account |
browser | String | Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox |
proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
tweets_count | Integer | Number of posts to scrape. Default is 10. |
output_format | String | The output format, whether JSON or CSV. Default is JSON. |
filename | String | If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed. |
directory | String | If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory. |
headless | Boolean | Whether to run crawler headlessly?. Default is True |
Keys of the output
Key | Type | Description |
tweet_id | String | Post Identifier(integer casted inside string) |
username | String | Username of the profile |
name | String | Name of the profile |
profile_picture | String | Profile Picture link |
replies | Integer | Number of replies of tweet |
retweets | Integer | Number of retweets of tweet |
likes | Integer | Number of likes of tweet |
is_retweet | boolean | Is the tweet a retweet? |
retweet_link | String | If it is retweet, then the retweet link else it'll be empty string |
posted_time | String | Time when tweet was posted in ISO 8601 format |
content | String | content of tweet as text |
hashtags | Array | Hashtags presents in tweet, if they're present in tweet |
mentions | Array | Mentions presents in tweet, if they're present in tweet |
images | Array | Images links, if they're present in tweet |
videos | Array | Videos links, if they're present in tweet |
tweet_url | String | URL of the tweet |
link | String | If any link is present inside tweet for some external website. |
To Scrap profile's tweets with API:
from twitter_scraper_selenium import scrape_profile_with_api
scrape_profile_with_api('elonmusk', output_filename='musk', tweets_count= 100)
scrape_profile_with_api()
Arguments:
Argument | Argument Type | Description |
username | String | Twitter's Profile username |
tweets_count | Integer | Number of tweets to scrape. |
output_filename | String | What should be the filename where output is stored?. |
output_dir | String | What directory output file should be saved? |
proxy | String | Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port. |
browser | String | Which browser to use for extracting out graphql key. Default is firefox. |
headless | String | Whether to run browser in headless mode? |
Output:
{
"1608939190548598784": {
"tweet_url" : "https://twitter.com/elonmusk/status/1608939190548598784",
"tweet_details":{
...
},
"user_details":{
...
}
}, ...
}
Using scraper with proxy (http proxy)
Just pass proxy
argument to function.
from twitter_scraper_selenium import scrape_profile
scrape_profile("elonmusk", headless=False, proxy="66.115.38.247:5678", output_format="csv",filename="musk") #In IP:PORT format
Proxy that requires authentication:
from twitter_scraper_selenium import scrape_profile
microsoft_data = scrape_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
proxy="sajid:pass123@66.115.38.247:5678") # username:password@IP:PORT
print(microsoft_data)
Privacy
This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrape anything private.
LICENSE
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file twitter_scraper_selenium-6.1.1.tar.gz
.
File metadata
- Download URL: twitter_scraper_selenium-6.1.1.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82849b7d98b0f7040acba47d32fd3e9342378eb895b0444741539791a738d890 |
|
MD5 | b2cd74d19f24c3584286ad69f91a471c |
|
BLAKE2b-256 | be62797e9f07a273e99087dae4d14a2a28339a5b26045c39a8916eca1fe340ab |
File details
Details for the file twitter_scraper_selenium-6.1.1-py3-none-any.whl
.
File metadata
- Download URL: twitter_scraper_selenium-6.1.1-py3-none-any.whl
- Upload date:
- Size: 32.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9507d0e498e1751b67df4ca2d8446049d14cf736d000a1358948250b449c3045 |
|
MD5 | 1f751593d1ce8cfc24b15a9095e403e3 |
|
BLAKE2b-256 | ca141fe3eab65721b6685aa4d2116d0ed6f4e479e393a40812a384c13b8d67a1 |