Script to store tweets of a list of users in a databases for NLP processing.
Project description
Tweet Archiveur
This project aim at storing tweets in a database. But you could use it without database.
- Input : tweetos id in a CSV file
- Output : A databases of tweets and hastags
The goal for us is to store tweets of all members of the French Parliament to get an idea of the trendings topics.
But you could use the project for other purpose with other people.
How to install the package
TODO : push it to Pipy when :
- Rename "nom" to name in users
- reactivate unit tests (https://docs.github.com/en/actions/guides/creating-postgresql-service-containers)
- Made scrapper a Class
- Switch to SQL Alchemy
- Flake8
- Documentation
pip install tweetarchiveur
How to use the package in your project
There is two class :
- A Scrapper() to use the Twitter API
- A Database() to store tweets and hastags in it
from tweet_archiveur.scrapper import Scrapper
from tweet_archiveur.database import Database
# Force some variable outside Docker
from os import environ
environ["DATABASE_PORT"] = '8479'
environ["DATABASE_HOST"] = 'localhost'
environ["DATABASE_USER"] = 'tweet_archiveur_user'
environ["DATABASE_PASS"] = '1234leximpact'
environ["DATABASE_NAME"] = 'tweet_archiveur'
scrapper = Scrapper()
df_users = scrapper.get_users_accounts('../tests/sample-users.csv')
users_id = df_users.twitter_id.tolist()
database = Database()
database.create_tables_if_not_exist()
database.insert_twitter_users(df_users)
scrapper.get_all_tweet_and_store_them(database, users_id[0:2])
del database
del scrapper
2021-03-22 10:21:59,837 - tweet-archiveur INFO Scrapper ready
2021-03-22 10:21:59,841 - tweet-archiveur INFO Loading database module...
2021-03-22 10:21:59,842 - tweet-archiveur DEBUG DEBUG : connect(user=tweet_archiveur_user, password=XXXX, host=localhost, port=8479, database=tweet_archiveur, url=None)
2021-03-22 10:22:03,915 - tweet-archiveur INFO Done scrapping, we got 400 tweets from 2 tweetos.
How we use it
We get the tweets of the 577 French Parliament member's every 8 hours and store them in a PostgreSQL database.
We then explore them with Apache Superset.
How we deploy it
Prepare the environment :
git clone https://github.com/leximpact/tweet-archiveur.git
cd tweet-archiveur
cp docker/docker.env .env
Edit the .env to your needs.
Run the application :
docker-compose up -d
To view what's going on :
docker logs tweet-archiveur_tweet_archiveur_1 -f
The script archiveur.py use the package to get the parliament accounts from https://github.com/regardscitoyens/twitter-parlementaires
The parameters is read in a .env file.
It is launched by the entrypoint.sh script every 8 hours.
To stop it :
docker-compose down
The data is kept in a docker volume, to clean them :
docker-compose down -v
What to do with it ?
- Most used hashtag (per period, per person)
- Most/Less active user
- Timeline of
- NLP Topic detection
- Word cloud
Annexes
Exit code :
- 1 : Unknown error when storing tweets
- 2 : Unknown error getting tweets
- 3 : Failed more than 3 consecutive times
- 4 : no env
If one thing fail no tweet will be saved.
status code = 429 : 429 'Too many requests' error is returned when you exceed the maximum number of requests allowed
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tweet_archiveur-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a208ee80b143e7abdf69485473848ef71f67dc7c6c15217cf74410ae7710a95 |
|
MD5 | 5e8f8766af64580a884ccf6dde39f6c3 |
|
BLAKE2b-256 | 119b216c684763344d4803712f97568f6822d8d05f6f192d9ea451ff6bdfd190 |