Package to scrap tweets

These details have not been verified by PyPI

Project links

Homepage

Project description

stweet

Python package

A modern fast python library to scrap tweets and users quickly from Twitter unofficial API.

This tool helps you to scrap tweets by a search phrase, tweets by ids and user by usernames. It uses the Twitter API, the same API is used on a website.

Inspiration for the creation of the library

I have used twint to scrap tweets, but it has many errors, and it doesn't work properly. The code was not simple to understand. All tasks have one config, and the user has to know the exact parameter. The last important thing is the fact that Api can change — Twitter is the API owner and changes depend on it. It is annoying when something does not work and users must report bugs as issues.

Main advantages of the library

Simple code — the code is not only mine, every user can contribute to the library
Domain objects and interfaces — the main part of functionalities can be replaced (eg. calling web requests), the library has basic simple solution — if you want to expand it, you can do it without any problems and forks
100% coverage with integration tests — this advantage can find the API changes, tests are carried out every week and when the task fails, we can find the source of change easily
Custom tweets and users output — it is a part of the interface, if you want to save tweets and users custom format, it takes you a brief moment

Installation

pip install -U stweet

Basic usage

To make a simple request the scrap task must be prepared. The next task should be processed by runner.

import stweet as st

search_tweets_task = st.SearchTweetsTask(
    all_words='#covid19'
)
tweets_collector = st.CollectorTweetOutput()

st.TweetSearchRunner(
    search_tweets_task=search_tweets_task,
    tweet_outputs=[tweets_collector, st.CsvTweetOutput('output_file.csv')]
).run()

tweets = tweets_collector.get_scrapped_tweets()

This simple code snippet calls for all tweets with hashtag #covid19. The result in tweets object is a list of scrapped tweets.

Above example shows how to scrap tweets by search phrase. Stweet has also scrapping by tweet id:

import stweet as st

tweets_by_ids_task = st.TweetsByIdsTask(['1336002732717727752', '1338916735479496704'])
tweets_collector = st.CollectorTweetOutput()

st.TweetsByIdsRunner(
    tweets_by_ids_task=tweets_by_ids_task,
    tweet_outputs=[tweets_collector, st.CsvTweetOutput('output_file.csv')]
).run()

tweets = tweets_collector.get_scrapped_tweets()

Stweet allows scrapping user information by users screen name:

import stweet as st

get_users_task = st.GetUsersTask(['donaldtuskEPP', 'JoeBiden', 'realDonaldTrump'])
users_collector = st.CollectorUserOutput()

st.GetUsersRunner(
    get_user_task=get_users_task,
    user_outputs=[users_collector]
).run()

users = users_collector.get_scrapped_users()

Stweet has a default WebClient implementation that is based on requests library — st.RequestsWebClient. Class have all properties with default values, but by changing them user can set proxy or disable ssl verification.

This snippet shows how to use it:

import stweet as st

search_tweets_task = st.SearchTweetsTask(
    all_words='#covid19',
)
tweets_collector = st.CollectorTweetOutput()

proxies_config = st.RequestsWebClientProxyConfig(
    http_proxy="<Your http proxy URL>",
    https_proxy="<Your https proxy URL>"
)

st.TweetSearchRunner(
    search_tweets_task=search_tweets_task,
    tweet_outputs=[tweets_collector, st.CsvTweetOutput('output_file.csv')],
    web_client=st.RequestsWebClient(proxy=proxies_config, verify=False),
).run()

tweets = tweets_collector.get_scrapped_tweets()

All important details and classes of this library are described below.

SearchTweetsTask

This class represents the task to scrap tweets. It contains the following properties:

Property	Type	Description
all_words	Optional[str]	Search for tweets having all words in property
exact_words	Optional[str]	Search for tweets with the unchanged order of words in property
any_word	Optional[str]	Search for tweets with any words in this property
from_username	Optional[str]	Search for tweets from the user
to_username	Optional[str]	Search for tweets to the user (tweets starts from mentioning the user)
since	Optional[Arrow]	Search for tweets since time
until	Optional[Arrow]	Search for tweets until time
language	Optional[st.Language]	Search for tweets with language
tweets_count	Optional[int]	Search first tweets_count tweets
replies_filter	Optional[st.RepliesFilter]	Filter tweets with reply/original status

All properties come from Twitter advanced search and are default None.

SearchRunner

With class SearchRunner library can scrap tweets specified in SearchTweetsTask. The runner has the following properties:

Property	Type	Default value	Description
search_run_context	st.SearchRunContext	None, in __init__() assign SearchRunContext()	Search context, contains all important properties to make the next request to Twitter
search_tweets_task	st.SearchTweetsTask	**Obligatory
property**	Property specifies which tweets should be downloaded by the runner
tweet_outputs	List[st.TweetOutput]	Obligatory property	List of objects to export downloaded tweets
web_client	st.WebClient	stweet.http_request.WebClientRequests()	Implementation of a WebClient, can be replaced for custom implementation
tweet_parser	st.TweetParser	stweet.parse.TwintBasedTweetParser()	Parser of tweets from web API response
auth_token_provider_factory	st.auth.AuthTokenProviderFactory	st.auth.SimpleAuthTokenProviderFactory()	Factory of AuthTokenProvider to provide auth tokens

TweetsByIdsTask

This class represents the task to scrap tweets by ids, it has simple property:

Property	Type	Description
tweet_ids	List[str]	ids of tweets to scrap

TweetsByIdsRunner

With class TweetsByIdsRunner library can scrap tweets specified in TweetsByIdsTask. The runner has the following properties:

Property	Type	Default value	Description
tweets_by_ids_task	st.TweetsByIdsTask	**Obligatory
property**	Property specifies which tweets should be downloaded by the runner
tweet_outputs	List[st.TweetOutput]	Obligatory property	List of objects to export downloaded tweets
search_run_context	st.SearchRunContext	None, in __init__() assign SearchRunContext()	Search context, contains all important properties to make the next request to Twitter
web_client	st.WebClient	stweet.http_request.WebClientRequests()	Implementation of a WebClient, can be replaced for custom implementation
tweet_parser	st.TweetParser	stweet.parse.TwintBasedTweetParser()	Parser of tweets from web API response
auth_token_provider_factory	st.auth.AuthTokenProviderFactory	st.auth.SimpleAuthTokenProviderFactory()	Factory of AuthTokenProvider to provide auth tokens

GetUsersTask

This class represents the task to scrap users, it has simple property:

Property	Type	Description
usernames	List[str]	usernames of users to scrap, username is usually used in Twitter with '@' prefix

GetUsersRunner

With class GetUsersRunner library can scrap users specified in GetUsersTask. The runner has the following properties:

Property	Type	Default value	Description
get_user_task	st.GetUsersTask	**Obligatory
property**	Property specifies which users should be downloaded by the runner
user_outputs	List[st.UserOutput]	Obligatory property	List of objects to export downloaded users
get_user_context	st.GetUsersContext	None, in __init__() assign GetUsersContext()	Search context, contains all important properties to make the next request to Twitter
web_client	st.WebClient	stweet.http_request.WebClientRequests()	Implementation of a WebClient, can be replaced for custom implementation
auth_token_provider_factory	st.auth.AuthTokenProviderFactory	st.auth.SimpleAuthTokenProviderFactory()	Factory of AuthTokenProvider to provide auth tokens

TweetOutput

TweetOutput is an interface which calls for exporting scrapped tweets. Stweet has a few implementations described below:

TweetOutput implementation	Description
CollectorTweetOutput	Output saves tweets in-memory, has the method get_scrapped_tweets() to return list of tweets
CsvTweetOutput	Output exports tweets to csv file
JsonLineFileTweetOutput	Output exports tweets as JSON objects, in each line of file there is one JSON object with a tweet
PrintEveryNTweetOutput	Output prints every N tweet on screen, N value can be assigned in the constructor
PrintFirstInRequestTweetOutput	Output prints the first tweet of an incoming request
PrintTweetOutput	Output prints all tweets

Additionally, TweetOutput can be implemented in many other ways.

UserOutput

UserOutput is an interface which calls for exporting scrapped users. Stweet has a few implementations described below:

UserOutput implementation	Description
CollectorUserOutput	Output saves users in-memory, has the method get_scrapped_users() to return list of users
CsvUserOutput	Output exports users to csv file
JsonLineFileUserOutput	Output exports users as JSON objects, in each line of file there is one JSON object with a user
PrintEveryNUserOutput	Output prints every N user on screen, N value can be assigned in the constructor
PrintFirstInRequestUserOutput	Output prints the first user of an incoming request
PrintUserOutput	Output prints all users

Additionally, UserOutput can be implemented in many other ways.

ProxyClientRequests

ProxyClientRequests is an implementation of a st.WebClient that allows using proxies as well as supply additional options that can be used in requests.request method.

Property	Type	Description
proxies	Dict[str, str]	Dictionary mapping protocol to the URL of the proxy.
options	Dict[str, Any]	Dictionary mapping a `requests.request` method param to its value.

Additionally, you can implement you own WebClient.

How to contribute

If you want to improve stweet library then please read the instruction in first-contributions repo. Remember to create pull request to develop branch.

You must have docker and docker-compose to run all tests on your computer. These dependencies start the proxy service which is required to run proxy tests. If you want to run tests locally please run tox command:

tox -v

Thank you for your every pull request. Together we can make this library better.

Twint inspiration

Small part of library uses code from twint. Twint was also main inspiration to create stweet.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.1.1

Feb 6, 2023

2.1.0

Jan 2, 2023

2.0.2

Aug 26, 2022

2.0.1

Aug 12, 2022

2.0.0

Oct 14, 2021

2.0.0c2 pre-release

Oct 14, 2021

2.0.0rc1 pre-release

Oct 13, 2021

1.3.1

Feb 26, 2021

1.3.0

Feb 15, 2021

1.2.2

Feb 6, 2021

1.2.1

Feb 6, 2021

1.2.0

Feb 6, 2021

1.1.2

Jan 16, 2021

1.1.1

Dec 27, 2020

This version

1.1.0

Dec 19, 2020

1.0.0

Dec 16, 2020

0.1.8

Dec 10, 2020

0.1.7

Dec 1, 2020

0.1.6

Nov 30, 2020

0.1.5

Nov 28, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stweet-1.1.0.tar.gz (26.4 kB view hashes)

Uploaded Dec 19, 2020 Source

Built Distribution

stweet-1.1.0-py3-none-any.whl (42.6 kB view hashes)

Uploaded Dec 19, 2020 Python 3

Hashes for stweet-1.1.0.tar.gz

Hashes for stweet-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d61eb903facd72e0b7e8d281d34e7545b0ac1c380dbc5e36a568409bc2b5a3d0`
MD5	`d281622e3ee1a4ec245bab05364e0a6a`
BLAKE2b-256	`5f68657142a3bd7104e5456354c5be163e99d6d30b7b8c2d99260188ca953497`

Hashes for stweet-1.1.0-py3-none-any.whl

Hashes for stweet-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3488903e3119f79ba51509ccd29bffdb960b44e0c664c811624ad176d5013b96`
MD5	`a175c0c2cd5d7e8a6d163fa95367993c`
BLAKE2b-256	`eb31cabb071b1e51165fa65988ffe9f562dadb5cff64e7cae23fc55134330393`