Epidemiology of Cohort Social Media
Project description
Overview
Epicosm: Epidemiology of Cohort Social Media. Epicosm is a suite of tools for working with social media data in the context of epidemiological research. It is aimed for use by epidemiologists who wish to gather, analyse and integrate social media data with existing longitudinal and cohort-study research. The tools can:
- Harvest ongoing and retrospective Tweets from a list of users.
- Real-time Twitter stream-listen from geographic locations, and collate into a database.
- Sentiment analysis of Tweets using labMT, Vader and LIWC (dictionary required for LIWC).
- [in development] Validation of sentiment analysis algorithms against groundtruth.
Instructions in a nutshell
1. Download the Epicosm repository
2. Install MongoDB version 4 or higher:
- In a Mac terminal
brew install mongodb
- In a Linux terminal
apt install mongodb
3. Put these three files into a folder:
epicosm_mac
ORepicosm_linux
, as downloaded from the repository in step1
,credentials.txt
file (provided here, but complete with your own Twitter access keys),- and your
user_list
(supplied by you: one screen name per line, plain text file).
4. Run Epicosm from your command line, including your run flags
- Epicosm will provide some help if it doesn't understand you, just type
./epicosm_linux
or./epicosm_mac
. See below for more details, but for example a typical harvest can be started with./epicosm_linux --user_harvest
•••
More detail
1 What does it do?
2 Running Epicosm from compiled python executable
3 Optional parameters
4 Natural Language Processing (Sentiment analysis)
5 Geoharvester
6 Data and other outputs
7 Running the python script manually
8 Licence
•••
1 What does it do?
Epicosm is a social media harvester, data manager and sentiment analyser. Currently, the platform uses Twitter as the data source and the sentiment analysis methods available are VADER, labMT and LIWC (you will need an LIWC dictionary for this). You provide a list of users, and it will gather and store all tweets and metadata (going back a maximum of 3240 tweets) for each user. Images, videos and other attachments are stored as URLs. All information is stored by MongoDB. Harvesting can be iterated, for example once a week it can gather new tweets and add them to the database. As well as the full database, output includes a comma-separated-values (.csv) file, with the default fields being the user id number, the tweet id number, time and date, and the tweet content. Epicosm can also harvest the friends of users (ie, the accounts that the user is following, not the followers of the user).
Epicosm uses MongoDB for data management, and this must be installed before being running Epicosm. This can be done through downloading and installing from the MongoDB website, or it can be done in a Terminal window with the commands
brew install mongodb
on a Mac
apt install mongodb
on Linux (Debian-based systems like Ubuntu).
Epicosm can be run in two ways. It can be run using the compiled python executables provided, epicosm_mac
or epicosm_linux
. If there are any issues with your input files (your user_list
and your credentials.txt
) Epicosm will try to help you. Alternatively, Epicosm can be run by Python version 3+; details are in section 4.
You will need Twitter API credentials by having a developer account authorised by Twitter. Please see our guide to getting an authorised account, and there are further details on Twitter documentation for how to do this. As of August 2020, Twitter are usually rapid in authorising for academic purposes, although this can of course change. You will find many guides for getting authorisation which are out of date!
•••
2 Running Epicosm from compiled python executable
This is the usual way of running Epicosm (see section 4 for running using Python).
You must provide 2 further files in the folder with the Epicosm executable:
- a list of user screen names in a file called
user_list
. The user list must be a plain text file, with a single username (twitter screen name) per line. - Twitter API credentials. Please see the file in this repository for a template. This file must be called
credentials.txt
.
Then you can run the python executable, for example
./epicosm_linux [your run flags]
or
./epicosm_mac [your run flags]
•••
3 Optional parameters
When running the harvester, please specify what you want Epicosm to do:
--user_harvest
Harvest tweets from all users from a file called user_list
(provided by you) with a single user per line. The database will be
backed up on every harvest, with a rotating backup of the last three
harvests. These can be imported into another instance of MongoDB
with mongoimport
, see MongoDB documentation for details.
--get_friends
. Create a database of the users that are
being followed by the accounts in your user_list.
(This process can be very slow, especially if
your users are prolific followers.) You will also get
a CSV of users and who they are following, in /output/csv
If using with --repeat, will only be gathered once.
--repeat
Iterate the user harvest every 3 days. This process will need to
be put to the background to free your terminal prompt,
or to leave running while logged out.
--refresh
If you have a new user_list, this will tell Epicosm to
take use this file as your updated user list.
--csv_snapshots
Make a CSV formatted snapshot of selected fields from every harvest.
See documentation for the format and fields of this CSV.
Be aware that this may take up disk space - see ./output/csv
Example of single harvest:
./epicosm --user_harvest
Example iterated harvest in background, with a renewed user_list and taking CSV snapshots:
nohup ./epicosm --user_harvest --refresh --csv_snapshots --repeat &
4 Natural Language Processing (Sentiment analysis)
Once you have a database with tweets, you can apply sentiment analysis to each document and insert the result into MongoDB. You will need to run epicosm_nlp.py
(if you have dependencies errors, please install them with pip3 install -r requirements.txt
).
To run, specify from the following flags:
--insert_groundtruth
Provide a file of groundtruth values called 'groundtruth.csv' and insert these into the local database.
--liwc
Apply LIWC (Pennebaker et al 2015) analysis and append values to the local database. You must have a LIWC dictionary in therun folder, named "LIWC.dic". LIWC has around 70 categories (including posemo and negemo), but many of these will return no value because tweets are too short to provide information. Empty categories are not appended to the database. **Note: the LIWC package is broken and cannot deal with its own dictionary. If it comes across phrasal entries it throws a key error. In LIWC 2015, most of these are variations on the word 'like' ('we like', 'they like', 'not like'), but the words 'like', 'not' 'we' are already in categories, and the phrasal categories have the same metrics anyway! You will need to clean your dictionary with the script in src called cleanLIWC.sh
.
--labmt
Apply labMT (Dodds & Danforth 2011) analysis and append values to the local database. LabMT provides a single positive - negative metric, ranging from -1 to 1 (1 being positive sentiment, 0 being neutral, -1 being negative).
--vader
Apply VADER (Hutto & Gilbert 2014) analysis and append values to the local database. VADER returns 4 metrics: positive, neutral, negative and compound. See their documentation for details.
--textblob
Apply TextBlob (github: @sloria) analysis and append values to the local database. TextBlob provides a single positive - negative metric, ranging from -1 to 1 (1 being positive sentiment, 0 being neutral, -1 being negative).
The results of these analyses will be appended to each tweet's record, under the field "epicosm", and stored in MongoDB.
•••
5 Geoharvester
The python script geoharvester.py
can launch a Twitter stream listener by geographic location, as defined by one or more latitude/longitude boxes. Please see the example geoboxes.py
for the format of this file. As above, you will need to provide your credentials.txt
to gain access to the Twitter streaming API. All tweets are stored in MongoDB under the database geotweets
and the collection geotweets_collection
. To sentiment analyse these, please see the section below on NLP. Few Tweets (historically, less than 2%) have geotags, but Twitter will try to assign a rough location based on city or country. As of 2020, Twitter is reporting they will phase out geotagging, since few people authorise Twitter to geotag their tweets.
6 Data and other outputs
The processed output is a a database of tweets from the users in your user_list
, and a CSV file, in the folder ./output/csv/
, which by default has the fields: [1] the ID of the tweeter, [2] the id of the tweet, [3] the time and date of the tweet, and [4] the tweet content.
Log files detailing what Epicosm has done is in /epicosm_logs/
.
Full tweet content and metadata of all tweets is stored in MongoDB in a format which is closely aligned with JSON. To work with full raw data, you will need MongoDB installed. The tweet database is named twitter_db
, with two collections tweets
, and friends
which contains a list of all users that each user in your list are following. The friends
collection will only be made if you ask for friends lists to be gathered. Currently, gathering friends list causes the process to be heavily rate limited by Twitter! [solution in progress]
A backup of the entire database is stored in /output/twitter_db/
. If you have MongoDB installed, this can be restored with the command
mongorestore [your name given to the database] [the path to the mongodump bson file]
for example:
mongoresotore -d twitter_db ./output/twitter_db/tweets.bson
(However, please check MongoDB documentation as commands can change)
To view and interact with the database using a GUI, you will need MongoDB installed, and a database viewer. Of open source options, we find that Robo 3T works very well.
•••
7 Running the python script manually
See the source file in /src
and run it with
python3 epicosm.py [your run flag]
You must provide 2 files:
- a list of user screen names in a file called
user_list
. The user list must be a plain text file, with a single username (twitter screen name) per line. - Twitter API credentials will need to be supplied, by editing the file
credentials.py
(further instructions inside file). You will need your own Twitter API credentials by having a developer account authorised by Twitter, and generating the required codes. Please see our guide, and there are further details on Twitter documentation on how to do this.
Please also see these further requirements.
- Put all repository files and your user list into their own folder. The python script must be run from the folder it is in.
- MongoDB version 4 or higher will need to be installed. It does not need to be running, the script will check MongoDB's status, and start it if it is not running. The working database will be stored in the folder where you place your local copy of this repository (not the default location of /data/db). For Linux and MacOS, use your package manager (eg. apt, yum, yast), for example:
apt install mongodb
(or yum
, brew
or other package manager as appropriate)
- The following Python3 dependencies will need to be installed from the
src/requirements.txt
file if you run
pip3 install -r requirements.txt
•••
8 Licence
DynamicGenetics/Epicosm is licensed under the GNU General Public License v3.0. For full details, please see our license file.
Epicosm is written and maintained by Alastair Tanner, University of Bristol, Integrative Epidemiology Unit.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file epicosm-0.0.1.tar.gz
.
File metadata
- Download URL: epicosm-0.0.1.tar.gz
- Upload date:
- Size: 24.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c3b26a1fe5a87268e3aa64694b4595bd48c7b3a8d1d7bd0b54d2c4579d6521d |
|
MD5 | 18de3826178a5aab39cf959df29cd395 |
|
BLAKE2b-256 | dfe630063823620caa928f4db5df85a236d18e4ee0c9770e8d4a85916908659f |
File details
Details for the file epicosm-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: epicosm-0.0.1-py3-none-any.whl
- Upload date:
- Size: 42.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d575dc026d11b63ee7ed02bc0cba1933ec0a18b90f24b0db798bdfafa989358e |
|
MD5 | 68151dd5a5d68e8f808209cef17b0f8a |
|
BLAKE2b-256 | 1fc93e351702e1bad17465b6909cc633144fa4d18857291783439b9fa1cd85ff |