Python package to scrap twitter's front-end easily with selenium

These details have not been verified by PyPI

Project links

Homepage

Project description

Twitter scraper selenium

Python's package to scrape Twitter's front-end easily with selenium.

Table of Contents

Getting Started
- Prerequisites
- Installation
  - Installing from source
  - Installing with PyPI
Usage
- Available Functions in this package- Summary
- Scraping profile's details
Privacy
License

Prerequisites

Internet Connection

Python 3.6+

Chrome or Firefox browser installed on your machine

Installation

Installing from the source

Download the source code or clone it with:

git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium

Open terminal inside the downloaded folder:

 python3 setup.py install

Installing with PyPI

pip3 install twitter-scraper-selenium

Usage

Available Function In this Package - Summary

Function Name	Function Description	Scraping Method	Scraping Speed
`scrape_profile()`	Scrape's Twitter user's profile tweets	Browser Automation	Slow
`scrape_keyword()`	Scrape's Twitter tweets using keyword provided.	Browser Automation	Slow
`scrape_topic()`	Scrape's Twitter tweets by URL. It expects the URL of the topic.	Browser Automation	Slow
`scrape_keyword_with_api()`	Scrape's Twitter tweets by query/keywords. For an advanced search, query can be built from here.	HTTP Request	Fast
`get_profile_details()`	Scrape's Twitter user details.	HTTP Request	Fast
`scrape_topic_with_api()`	Scrape's Twitter tweets by URL. It expects the URL of the topic	Browser Automation & HTTP Request	Fast
`scrape_profile_with_api()`	Scrape's Twitter tweets by twitter profile username. It expects the username of the profile	Browser Automation & HTTP Request	Fast

Note: HTTP Request Method sends the request to Twitter's API directly for scraping data, and Browser Automation visits that page, scroll while collecting the data.

To scrape twitter profile details:

from twitter_scraper_selenium import get_profile_details

twitter_username = "TwitterAPI"
filename = "twitter_api_data"
get_profile_details(twitter_username=twitter_username, filename=filename)

Output:

{
	"id": 6253282,
	"id_str": "6253282",
	"name": "Twitter API",
	"screen_name": "TwitterAPI",
	"location": "San Francisco, CA",
	"profile_location": null,
	"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
	"url": "https:\/\/t.co\/8IkCzCDr19",
	"entities": {
		"url": {
			"urls": [{
				"url": "https:\/\/t.co\/8IkCzCDr19",
				"expanded_url": "https:\/\/developer.twitter.com",
				"display_url": "developer.twitter.com",
				"indices": [
					0,
					23
				]
			}]
		},
		"description": {
			"urls": []
		}
	},
	"protected": false,
	"followers_count": 6133636,
	"friends_count": 12,
	"listed_count": 12936,
	"created_at": "Wed May 23 06:01:13 +0000 2007",
	"favourites_count": 31,
	"utc_offset": null,
	"time_zone": null,
	"geo_enabled": null,
	"verified": true,
	"statuses_count": 3656,
	"lang": null,
	"contributors_enabled": null,
	"is_translator": null,
	"is_translation_enabled": null,
	"profile_background_color": null,
	"profile_background_image_url": null,
	"profile_background_image_url_https": null,
	"profile_background_tile": null,
	"profile_image_url": null,
	"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
	"profile_banner_url": null,
	"profile_link_color": null,
	"profile_sidebar_border_color": null,
	"profile_sidebar_fill_color": null,
	"profile_text_color": null,
	"profile_use_background_image": null,
	"has_extended_profile": null,
	"default_profile": false,
	"default_profile_image": false,
	"following": null,
	"follow_request_sent": null,
	"notifications": null,
	"translator_type": null
}

get_profile_details() arguments:

Argument	Argument Type	Description
twitter_username	String	Twitter Username
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.

Keys of the output:
Detail of each key can be found here.

To scrape profile's tweets:

In JSON format:

from twitter_scraper_selenium import scrape_profile

microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
print(microsoft)

Output:

{
  "1430938749840629773": {
    "tweet_id": "1430938749840629773",
    "username": "Microsoft",
    "name": "Microsoft",
    "profile_picture": "https://twitter.com/Microsoft/photo",
    "replies": 29,
    "retweets": 58,
    "likes": 453,
    "is_retweet": false,
    "retweet_link": "",
    "posted_time": "2021-08-26T17:02:38+00:00",
    "content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
    "hashtags": [],
    "mentions": [],
    "images": [],
    "videos": [],
    "tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
    "link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
  },...
}

In CSV format:

from twitter_scraper_selenium import scrape_profile


scrape_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")

Output:

tweet_id	username	name	profile_picture	replies	retweets	likes	is_retweet	retweet_link	posted_time	content	hashtags	mentions	images	videos	post_url	link
1430938749840629773	Microsoft	Microsoft	https://twitter.com/Microsoft/photo	64	75	521	False		2021-08-26T17:02:38+00:00	Easy to use and efficient for all – Windows 11 is committed to an accessible future. Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW	[]	[]	[]	[]	https://twitter.com/Microsoft/status/1430938749840629773	https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC

...

scrape_profile() arguments:

Argument	Argument Type	Description
twitter_username	String	Twitter username of the account
browser	String	Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
tweets_count	Integer	Number of posts to scrape. Default is 10.
output_format	String	The output format, whether JSON or CSV. Default is JSON.
filename	String	If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed.
directory	String	If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory.
headless	Boolean	Whether to run crawler headlessly?. Default is `True`

Keys of the output

Key	Type	Description
tweet_id	String	Post Identifier(integer casted inside string)
username	String	Username of the profile
name	String	Name of the profile
profile_picture	String	Profile Picture link
replies	Integer	Number of replies of tweet
retweets	Integer	Number of retweets of tweet
likes	Integer	Number of likes of tweet
is_retweet	boolean	Is the tweet a retweet?
retweet_link	String	If it is retweet, then the retweet link else it'll be empty string
posted_time	String	Time when tweet was posted in ISO 8601 format
content	String	content of tweet as text
hashtags	Array	Hashtags presents in tweet, if they're present in tweet
mentions	Array	Mentions presents in tweet, if they're present in tweet
images	Array	Images links, if they're present in tweet
videos	Array	Videos links, if they're present in tweet
tweet_url	String	URL of the tweet
link	String	If any link is present inside tweet for some external website.

To scrape tweets using keywords with API:

from twitter_scraper_selenium import scrape_keyword_with_api

query = "#gaming"
tweets_count = 10
output_filename = "gaming_hashtag_data"
scrape_keyword_with_api(query=query, tweets_count=tweets_count, output_filename=output_filename)

Output:

{
  "1583821467732480001": {
    "tweet_url" : "https://twitter.com/yakubblackbeard/status/1583821467732480001",
    "tweet_details":{
      ...
    },
    "user_details":{
      ...
    }
  }, ...
}

scrape_keyword_with_api() arguments:

Argument	Argument Type	Description
query	String	Query to search. The query can be built from here for advanced search.
tweets_count	Integer	Number of tweets to scrape.
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.

Keys of the output:

Key	Type	Description
tweet_url	String	URL of the tweet.
tweet_details	Dictionary	A dictionary containing the data about the tweet. All fields which will be available inside can be checked here
user_details	Dictionary	A dictionary containing the data about the tweet owner. All fields which will be available inside can be checked here

To scrape tweets using keywords with browser automation

In JSON format:

from twitter_scraper_selenium import scrape_keyword
#scrape 10 posts by searching keyword "india" from date 30th August till date 31st August
india = scrape_keyword(keyword="india", browser="firefox",
                      tweets_count=10,output_format="json" ,until="2021-08-31", since="2021-08-30")
print(india)

Output:

{
  "1432493306152243200": {
    "tweet_id": "1432493306152243200",
    "username": "TOICitiesNews",
    "name": "TOI Cities",
    "profile_picture": "https://twitter.com/TOICitiesNews/photo",
    "replies": 0,
    "retweets": 0,
    "likes": 0,
    "is_retweet": false,
    "posted_time": "2021-08-30T23:59:53+00:00",
    "content": "Paralympians rake in medals, India Inc showers them with rewards",
    "hashtags": [],
    "mentions": [],
    "images": [],
    "videos": [],
    "tweet_url": "https://twitter.com/TOICitiesNews/status/1432493306152243200",
    "link": "https://t.co/odmappLovL?amp=1"
  },...
}

In CSV format:

from twitter_scraper_selenium import scrape_keyword

scrape_keyword(keyword="india", browser="firefox",
                      tweets_count=10, until="2021-08-31", since="2021-08-30",output_format="csv",filename="india")

Output:

tweet_id	username	name	profile_picture	replies	retweets	likes	is_retweet	posted_time	content	hashtags	mentions	images	videos	tweet_url	link
1432493306152243200	TOICitiesNews	TOI Cities	https://twitter.com/TOICitiesNews/photo	0	0	0	False	2021-08-30T23:59:53+00:00	Paralympians rake in medals, India Inc showers them with rewards	[]	[]	[]	[]	https://twitter.com/TOICitiesNews/status/1432493306152243200	https://t.co/odmappLovL?amp=1

...

scrape_keyword() arguments:

Argument	Argument Type	Description
keyword	String	Keyword to search on twitter.
browser	String	Which browser to use for scraping?, Only 2 are supported Chrome and Firefox,default is set to Firefox.
until	String	Optional parameter, Until date for scraping, a end date from where search ends. Format for date is YYYY-MM-DD.
since	String	Optional parameter, Since date for scraping, a past date from where to search from. Format for date is YYYY-MM-DD.
proxy	Integer	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port
tweets_count	Integer	Number of posts to scrape. Default is 10.
output_format	String	The output format, whether JSON or CSV. Default is JSON.
filename	String	If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as keyword passed.
directory	String	If output parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory.
since_id	Integer	After (NOT inclusive) a specified Snowflake ID. Example here
max_id	Integer	At or before (inclusive) a specified Snowflake ID. Example here
within_time	String	Search within the last number of days, hours, minutes, or seconds. Example `2d, 3h, 5m, 30s`.
headless	Boolean	Whether to run crawler headlessly?. Default is `True`

Keys of the output

Key	Type	Description
tweet_id	String	Post Identifier(integer casted inside string)
username	String	Username of the profile
name	String	Name of the profile
profile_picture	String	Profile Picture link
replies	Integer	Number of replies of tweet
retweets	Integer	Number of retweets of tweet
likes	Integer	Number of likes of tweet
is_retweet	boolean	Is the tweet a retweet?
posted_time	String	Time when tweet was posted in ISO 8601 format
content	String	content of tweet as text
hashtags	Array	Hashtags presents in tweet, if they're present in tweet
mentions	Array	Mentions presents in tweet, if they're present in tweet
images	Array	Images links, if they're present in tweet
videos	Array	Videos links, if they're present in tweet
tweet_url	String	URL of the tweet
link	String	If any link is present inside tweet for some external website.

To scrape topic tweets with URL using API

from twitter_scraper_selenium import scrape_topic_with_api

topic_url = 'https://twitter.com/i/topics/1468157909318045697'
scrape_topic_with_api(URL=topic_url, output_filename='solana_cryptocurrency', tweets_count=50)

Output:

{
  "1584979408338632705": {
    "tweet_url" : "https://twitter.com/AptosBullCNFT/status/1584979408338632705",
    "tweet_details":{
      ...
    },
    "user_details":{
      ...
    }
  }, ...
}

scrape_topic_with_api() arguments:

Argument	Argument Type	Description
URL	String	Twitter's Topic URL
tweets_count	Integer	Number of tweets to scrape.
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
browser	String	Which browser to use for extracting out graphql key. Default is firefox.
headless	String	Whether to run browser in headless mode?

Keys of the output:

Same as scrape_keyword_with_api

To scrape topic tweets with URL using browser automation:

from twitter_scraper_selenium import scrape_topic
# scrape 10 tweets from steam deck topic on twitter
data = scrape_topic(filename="steamdeck", url='https://twitter.com/i/topics/1415728297065861123',
                     browser="firefox", tweets_count=10)

Keys of the output:

Same as scrape_profile

scrape_topic() arguments:

Arguments	Argument Type	Description
filename	str	Filename to write result output.
URL	str	Topic URL.
browser	str	Which browser to use for scraping? Only 2 are supported Chrome and Firefox. default firefox
proxy	str	If user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port
tweets_count	int	Number of posts to scrape. default 10.
output_format	str	The output format whether JSON or CSV. Default json.
directory	str	Directory to save output file. Deafult current working directory.

To Scrap profile's tweets with API:

from twitter_scraper_selenium import scrape_profile_with_api

scrape_profile_with_api('elonmusk', output_filename='musk', tweets_count= 100)

scrape_profile_with_api() Arguments:

Argument	Argument Type	Description
username	String	Twitter's Profile username
tweets_count	Integer	Number of tweets to scrape.
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
browser	String	Which browser to use for extracting out graphql key. Default is firefox.
headless	String	Whether to run browser in headless mode?

Output:

{
  "1608939190548598784": {
    "tweet_url" : "https://twitter.com/elonmusk/status/1608939190548598784",
    "tweet_details":{
      ...
    },
    "user_details":{
      ...
    }
  }, ...
}

Using scraper with proxy (http proxy)

Just pass proxy argument to function.

from twitter_scraper_selenium import scrape_keyword

scrape_keyword(keyword="#india", browser="firefox",tweets_count=10,output="csv",filename="india",
proxy="66.115.38.247:5678") #In IP:PORT format

Proxy that requires authentication:

from twitter_scraper_selenium import scrape_profile

microsoft_data = scrape_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
                      proxy="sajid:pass123@66.115.38.247:5678")  #  username:password@IP:PORT
print(microsoft_data)

Privacy

This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrape anything private.

LICENSE

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

6.2.2

Sep 7, 2024

6.1.2

Feb 12, 2024

6.1.1

Oct 2, 2023

6.1.0

Oct 1, 2023

6.0.0

Oct 1, 2023

This version

5.0.0

Jun 4, 2023

4.1.5

Jun 4, 2023

4.1.4

Jan 14, 2023

4.1.3

Dec 31, 2022

4.1.2

Dec 31, 2022

4.0.2

Dec 3, 2022

4.0.1

Oct 29, 2022

4.0.0

Oct 26, 2022

3.2.4

Oct 26, 2022

3.2.3

Oct 23, 2022

3.1.3

Oct 22, 2022

3.0.3

Oct 9, 2022

3.0.2

Oct 9, 2022

3.0.1

Oct 5, 2022

3.0.0

Oct 2, 2022

2.0.0

Jul 9, 2022

0.1.7

Jun 13, 2022

0.1.6

May 22, 2022

0.1.5

Apr 14, 2022

0.1.4

Apr 13, 2022

0.1.3

Apr 3, 2022

0.1.2

Mar 18, 2022

0.1.1

Nov 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twitter_scraper_selenium-5.0.0.tar.gz (34.7 kB view details)

Uploaded Jun 4, 2023 Source

Built Distribution

twitter_scraper_selenium-5.0.0-py3-none-any.whl (33.3 kB view details)

Uploaded Jun 4, 2023 Python 3

File details

Details for the file twitter_scraper_selenium-5.0.0.tar.gz.

File metadata

Download URL: twitter_scraper_selenium-5.0.0.tar.gz
Upload date: Jun 4, 2023
Size: 34.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for twitter_scraper_selenium-5.0.0.tar.gz
Algorithm	Hash digest
SHA256	`99b69efd8edd956073db33edd5bf2a651e6e89f82c99edada4300d0821d07695`
MD5	`53624f7d762b2187fd417b198e949ce8`
BLAKE2b-256	`ee30f391ccfe52b741e0763c3e87f8e49dced32663ed1f0e5797de640eda1611`

See more details on using hashes here.

File details

Details for the file twitter_scraper_selenium-5.0.0-py3-none-any.whl.

File metadata

Download URL: twitter_scraper_selenium-5.0.0-py3-none-any.whl
Upload date: Jun 4, 2023
Size: 33.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for twitter_scraper_selenium-5.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7bb2b7de70fd577111fb71428a2daf3cffd6ad85d7e2edf1b1021f710bce0837`
MD5	`787ae2cf387ef0753aed11a6d8b16b88`
BLAKE2b-256	`2ce5a9af9e99588d34bb09e4f10931eb1b0da8d8cbc0d1e8eec363b8fc4fdc04`