Skip to main content

Dozent is a powerful downloader that is used to download a ton of twitter data from the internet archive.

Project description

Dozent

Dozent is a powerful downloader that is used to collect large amounts of Twitter data from the internet archive.

It is built on top of PySmartDL and multithreading, similar to how traditional download accelerators like axel, aria2c and aws s3 work, ensuring that the biggest bottlenecks are your network and your hardware.

The data that is downloaded is already heavily compressed to reduce download times and save local storage. When uncompressed, the data can easily add up to several terabytes depending on the timeframe of data being collected. Fortunately, you do not have to decompress the data to analyze it! We are working on a separate big data tool named Murpheus that uses Dask to analyze the data without needing to decompress it.

If you have any ideas on how to improve Dozent, please open an issue here and tell us how!

Installation

Before installing, ensure that the version of python that you're using is python>=3.6. We intend to support all of the latest releases of as they come out

Installing with pip

Installing with pip is as easy as:

pip install dozent

Installing with Docker

In keeping with our goal for keeping everything we distribute as lightweight as possible, we include a docker image that would ensure that this process is as painless as possible without having to worry about python versions and so on.

While "installing" isn't something that we can do with docker, we felt it best to include a some helpful links to help new comers install docker.

You can find the link to the installation here. If you chose to go this route, we suggest jumping down to the Run Dozent as a Docker Container section after installing docker.

Usage

$ python -m dozent --help

usage: __main__.py [-h] -s START_DATE -e END_DATE [-t TIMEIT]
                 [-o OUTPUT_DIRECTORY] [-q]

A powerful downloader to get tweets from twitter for our compute. The first
step of many

optional arguments:
  -h, --help            show this help message and exit
  -s START_DATE, --start-date START_DATE
                        The date from where we download. The format must be:
                        YYYY-MM-DD
  -e END_DATE, --end-date END_DATE
                        The last day that we download. The format must be:
                        YYYY-MM-DD
  -t TIMEIT, --timeit TIMEIT
                        Show total program runtime
  -o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
                        Output Directory where the file will be stored.
                        Defaults to the data/ directory
  -q, --quiet           Turn off output (except for errors and warnings)

Downloading with Dozent after installing with pip

Downloading all tweets from 12th of May 2020 to 15th of May 2020

$ python -m dozent -s 2020-05-12 -e 2020-05-15

Queueing tweets download for 05-2020
Queueing tweets download for 05-2020
Queueing tweets download for 05-2020
Queueing tweets download for 05-2020
https://archive.org/download/archiveteam-twitter-stream-2020-05/twitter_stream_2020_05_13.tar [downloading] 16 Mb / 2498 Mb @ 1.6 MB/s [------------------] [0%, 32 minutes, 31 seconds left]

Downloading with Dozent after installing Docker

Pull the latest Dozent image from Docker Hub

$ docker pull socialmediapublicanalysis/dozent:latest

Get help

$ docker run -it socialmediapublicanalysis/dozent:latest

or

$ docker run -it socialmediapublicanalysis/dozent:latest -h

Download all tweets from March 12th, 2020 to March 15th, 2020

$ docker run -it socialmediapublicanalysis/dozent:latest python -m dozent -s 2020-05-12 -e 2020-05-15

About the Data

  • Only collects Tweets in the English language
  • Tweets are stored in JSON format
  • Each day is a compressed file roughly 2.5 GB large or ~ 32 GB uncompressed
  • Each tweet has accompanying metadata about the tweet and user

Sample Data

Interested in seeing what the data that Dozent collects looks like?

Check it out!

https://dozent-tests.s3.amazonaws.com/sample_data.json

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dozent-1.0.1.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

dozent-1.0.1-py3-none-any.whl (25.8 kB view details)

Uploaded Python 3

File details

Details for the file dozent-1.0.1.tar.gz.

File metadata

  • Download URL: dozent-1.0.1.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for dozent-1.0.1.tar.gz
Algorithm Hash digest
SHA256 ed0301af4d9e7aa146052e867d3b84446e1db58ac0d9b95d02a5c5331a601840
MD5 38d201d2d5f1cba58eaba7a69b817125
BLAKE2b-256 b77dbc05d7e8ec542657dfaedf7db0628de7e7b60ed2a6cb866227f4558cb512

See more details on using hashes here.

File details

Details for the file dozent-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: dozent-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 25.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for dozent-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bc9340477679aba3be08f1c9da55aba062b46c42185858f1e8e1763cd094f416
MD5 34cd86667be02e85c55956b1d2cee219
BLAKE2b-256 5e686e7b625d07caf4779dd7da2aabf6743db78252f0d92506dc0ddea9d92ba7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page