Skip to main content

Dozent is a powerful downloader that is used to download a ton of twitter data from the internet archive.

Project description

Dozent

Dozent is a powerful downloader that is used to download a ton of twitter data from the internet archive.

It is built on top of PySmartDL and multithreading, similar to how traditional download accelerators like axel, aria2c and aws s3 work, ensuring that the biggest bottlenecks are your network and your hardware.

If you have any ideas on how to make this even faster, please open an issue here and tell us how!

Getting Started

To get started, just follow the Getting Started part of our main ReadMe (Linked here)

Usage

Here's the help from the dozent:

$ python -m dozent --help

usage: __main__.py [-h] -s START_DATE -e END_DATE [-t TIMEIT]
                 [-o OUTPUT_DIRECTORY] [-q]

A powerful downloader to get tweets from twitter for our compute. The first
step of many

optional arguments:
  -h, --help            show this help message and exit
  -s START_DATE, --start-date START_DATE
                        The date from where we download. The format must be:
                        YYYY-MM-DD
  -e END_DATE, --end-date END_DATE
                        The last day that we download. The format must be:
                        YYYY-MM-DD
  -t TIMEIT, --timeit TIMEIT
                        Show total program runtime
  -o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
                        Output Directory where the file will be stored.
                        Defaults to the data/ directory
  -q, --quiet           Turn off output (except for errors and warnings)

Example

Here's an example of how the project works:

The general workflow that we envision is that the user downloads the files for the days that they're interested in, preprocessing for the specifics that you'll looking for, and running more complex algorithms on top of that.

$ python -m dozent -s 2020-05-12 -e 2020-05-15

Queueing tweets download for 05-2020
Queueing tweets download for 05-2020
Queueing tweets download for 05-2020
Queueing tweets download for 05-2020
https://archive.org/download/archiveteam-twitter-stream-2020-05/twitter_stream_2020_05_13.tar [downloading] 16 Mb / 2498 Mb @ 1.6 MB/s [------------------] [0%, 32 minutes, 31 seconds left]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dozent-0.7.tar.gz (15.6 kB view hashes)

Uploaded Source

Built Distribution

dozent-0.7-py3-none-any.whl (19.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page