Skip to main content

Package to scrap tweets

Project description

stweet

Open Source Love Python package PyPI version MIT Licence

A modern fast python library to scrap tweets and users quickly from Twitter unofficial API.

This tool helps you to scrap tweets by a search phrase, tweets by ids and user by usernames. It uses the Twitter API, the same API is used on a website.

Inspiration for the creation of the library

I have used twint to scrap tweets, but it has many errors, and it doesn't work properly. The code was not simple to understand. All tasks have one config, and the user has to know the exact parameter. The last important thing is the fact that Api can change — Twitter is the API owner and changes depend on it. It is annoying when something does not work and users must report bugs as issues.

Main advantages of the library

  • Simple code — the code is not only mine, every user can contribute to the library
  • Domain objects and interfaces — the main part of functionalities can be replaced (eg. calling web requests), the library has basic simple solution — if you want to expand it, you can do it without any problems and forks
  • 100% coverage with integration tests — this advantage can find the API changes, tests are carried out every week and when the task fails, we can find the source of change easily – not in version 2.0
  • Custom tweets and users output — it is a part of the interface, if you want to save tweets and users custom format, it takes you a brief moment

Installation

pip install -U stweet

Donate

If you want to sponsor me, in thanks for the project, please send me some crypto 😁:

Coin Wallet address
Bitcoin 3EajE9DbLvEmBHLRzjDfG86LyZB4jzsZyg
Etherum 0xE43d8C2c7a9af286bc2fc0568e2812151AF9b1FD

Basic usage

To make a simple request the scrap task must be prepared. The task should be processed by ** runner**.

import stweet as st


def try_search():
    search_tweets_task = st.SearchTweetsTask(all_words='#covid19')
    output_jl_tweets = st.JsonLineFileRawOutput('output_raw_search_tweets.jl')
    output_jl_users = st.JsonLineFileRawOutput('output_raw_search_users.jl')
    output_print = st.PrintRawOutput()
    st.TweetSearchRunner(search_tweets_task=search_tweets_task,
                         tweet_raw_data_outputs=[output_print, output_jl_tweets],
                         user_raw_data_outputs=[output_print, output_jl_users]).run()


def try_user_scrap():
    user_task = st.GetUsersTask(['iga_swiatek'])
    output_json = st.JsonLineFileRawOutput('output_raw_user.jl')
    output_print = st.PrintRawOutput()
    st.GetUsersRunner(get_user_task=user_task, raw_data_outputs=[output_print, output_json]).run()


def try_tweet_by_id_scrap():
    id_task = st.TweetsByIdTask('1447348840164564994')
    output_json = st.JsonLineFileRawOutput('output_raw_id.jl')
    output_print = st.PrintRawOutput()
    st.TweetsByIdRunner(tweets_by_id_task=id_task,
                        raw_data_outputs=[output_print, output_json]).run()


if __name__ == '__main__':
    try_search()
    try_user_scrap()
    try_tweet_by_id_scrap()

Example above shows that it is few lines of code required to scrap tweets.

Export format

Stweet uses api from website so there is no documentation about receiving response. Response is saving as raw so final user must parse it on his own. Maybe parser will be added in feature.

Scrapped data can be exported in different ways by using RawDataOutput abstract class. List of these outputs can be passed in every runner – yes it is possible to export in two different ways.

Currently, stweet have implemented:

  • CollectorRawOutput – can save data in memory and return as list of objects
  • JsonLineFileRawOutput – can export data as json lines
  • PrintEveryNRawOutput – prints every N-th item
  • PrintFirstInBatchRawOutput – prints first item in batch
  • PrintRawOutput – prints all items (not recommended in large scrapping)

Using tor proxy

Library is integrated with tor-python-easy. It allows using tor proxy with exposed control port – to change ip when it is needed.

If you want to use tor proxy client you need to prepare custom web client and use it in runner.

You need to run tor proxy -- you can run it on your local OS, or you can use this docker-compose.

Code snippet below show how to use proxy:

import stweet as st

if __name__ == '__main__':
    web_client = st.DefaultTwitterWebClientProvider.get_web_client_preconfigured_for_tor_proxy(
        socks_proxy_url='socks5://localhost:9050',
        control_host='localhost',
        control_port=9051,
        control_password='test1234'
    )

    search_tweets_task = st.SearchTweetsTask(all_words='#covid19')
    output_jl_tweets = st.JsonLineFileRawOutput('output_raw_search_tweets.jl')
    output_jl_users = st.JsonLineFileRawOutput('output_raw_search_users.jl')
    output_print = st.PrintRawOutput()
    st.TweetSearchRunner(search_tweets_task=search_tweets_task,
                         tweet_raw_data_outputs=[output_print, output_jl_tweets],
                         user_raw_data_outputs=[output_print, output_jl_users],
                         web_client=web_client).run()

Divide scrap periods recommended

Twitter on guest client block multiple pagination. Sometimes in one query there is possible to call for 3 paginations. To avoid this limitation divide scrapping period for smaller parts.

Twitter in 2023 block in API putting time range in timestamp – only format YYYY-MM-DD is acceptable. In arrow you can only put time without hours.

Twint inspiration

Small part of library uses code from twint. Twint was also main inspiration to create stweet.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stweet-2.1.1.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

stweet-2.1.1-py3-none-any.whl (35.5 kB view details)

Uploaded Python 3

File details

Details for the file stweet-2.1.1.tar.gz.

File metadata

  • Download URL: stweet-2.1.1.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.8.16 Linux/5.15.0-1031-azure

File hashes

Hashes for stweet-2.1.1.tar.gz
Algorithm Hash digest
SHA256 f34852b5b6bf8e48e61a9b73311738ec1e016aa4372c8953ef25ccd9139d4ab6
MD5 dddd43c2d874c27370bcd70dc25544ad
BLAKE2b-256 1e713759442609777acb231e0d423595beb122f549110c50d89757a3c7186cac

See more details on using hashes here.

File details

Details for the file stweet-2.1.1-py3-none-any.whl.

File metadata

  • Download URL: stweet-2.1.1-py3-none-any.whl
  • Upload date:
  • Size: 35.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.8.16 Linux/5.15.0-1031-azure

File hashes

Hashes for stweet-2.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b6f07ba2a3172d659449433704543b3c855887c22851735ed5bbed1c588bff9f
MD5 242a7804e684d75f44d191ff7a783a8b
BLAKE2b-256 e0e3f4c5b6bdd85b80f8d790cbdb989e80c9c01a962c1937ba661987f267820e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page