Skip to main content

NASTY Advanced Search Tweet Yielder

Project description

Logo

NASTY is a tool/library for retrieving Tweets via the Twitter Web UI. Instead of using the Twitter Developer API it works by acting like a normal web browser accessing Twitter. That is, it sends AJAX requests and parses Twitter’s JSON responses. This approach makes it substantially different from the other popular crawlers and allows for the following features:

  • Search for Tweets by keyword (and filter by latest/top/photos/videos, date of authorship, and language).

  • Retrieve all direct replies to a Tweet.

  • Retrieve all Tweets threaded under a Tweet.

  • Return fully-hydrated JSON-objects of Tweets that exactly match the extended mode of the developer API

  • Operate in batch mode to execute a large set of requests, abort at any time, and rerun both uncompleted and failed requests.

  • Transform collected Tweets into sets of Tweet-IDs for publishing datasets. Automatically download full Tweet information from sets of Tweet-IDs.

  • Written in tested, linted, and fully type-checked Python code.

Installation

Python 3.6, 3.7, 3.8 and PyPy are currently supported. Install via:

$ pip install nasty

Next, you need to place the configuration file in a location where NASTY searches for it. For example:

$ mkdir -p .config
$ curl -o .config/nasty.toml https://raw.githubusercontent.com/lschmelzeisen/nasty/master/config-example.nasty.toml

The places where NASTY looks for the nasty.toml file are in order:

  • In a .config sub-directory of the current directory and all subdirectories: ./.config/nasty.toml, ../.config/nasty.toml, etc.

  • If the respective environment variables exists in ${XDG_CONFIG_HOME}/nasty.toml and ${XDG_CONFIG_DIRS}/nasty.toml. If not, it defaults to ~/.config/nasty.toml and /etc/xdg/nasty.toml.

That’s it. For most operations you won’t need to modify the default settings at all.

Command Line Interface

To get help for the command line interface use the --help option:

$ nasty --help
usage: nasty [-h] [-v] [search|replies|thread|batch|idify|unidify] ...

NASTY Advanced Search Tweet Yielder.

Commands:
  The following commands (and abbreviations) are available, each supporting
  the help option. For example, try out `nasty search --help`.

  <COMMAND>
    search (s)         Retrieve Tweets using the Twitter advanced search.
    replies (r)        Retrieve all directly replying Tweets to a Tweet.
    thread (t)         Retrieve all Tweets threaded under a Tweet.
    batch (b)          Execute previously created batch of requests.
    idify (i, id)      Reduce Tweet-collection to Tweet-IDs (for publishing).
    unidify (u, unid)  Collect full Tweet information from Tweet-IDs (via
                       official Twitter API).

General Arguments:
  -h, --help           Show this help message and exit.
  -v, --version        Show program's version number and exit.
  --log-level <LEVEL>  Logging level (DEBUG, INFO, WARN, ERROR.)

You can also get help for the individual sub commands. For example, try out nasty search --help.

replies

You can fetch all direct replies to the Tweet with ID 332308211321425920:

$ nasty replies --tweet-id 332308211321425920

thread

You can fetch all Tweets threaded under the Tweet with ID 332308211321425920:

$ nasty thread --tweet-id 332308211321425920

batch

NASTY supports appending requests to a batch file instead of executing them immediately, so that they can executed in batch mode later. The benefits of this include being able to track the progress of a large set of requests, aborting at any time, and rerunning both completed and failed requests.

To append a request to a batch file, use the --to-batch argument on any of the above requests, for example:

$ nasty search --query "climate change" --to-batch batch.jsonl

To run all files stored in a jobs file and write the output to directory out/:

$ nasty batch --batch-file batch.jsonl --results-dir out/

When this command finished a tally of successful, skippend, and failed requests is printed. If any request failed, you may retry execution with the same command. Requests that succeeded will automatically be skipped.

idify / unidify

The Twitter Developer Policy states that for sharing collected Tweets with others, only Tweet-IDs may be (publicly) distributed (see Legal and Moral Considerations for more information).

To transform lines of Tweet-JSON-objects into lines of Tweet-IDs, use nasty idify. For example:

$ nasty search --query "climate change" | nasty idify > climate-change-tweet-ids.txt

To perform the reverse, that is getting full Tweet information from just Tweet-IDs, use nasty unidify:

$ cat climate-change-tweet-ids.txt | nasty unidify

Note that unidify is implemented using the Twitter Developer API, since for this specific case, the available free API covers all needed functionality and rate-limits are not to limiting. Additionally, this means, that this specific functionality is officially supported by Twitter, meaning the API should be stable over time (thus making it ideal for reproducing shared datasets of Tweets).

The downside is that you need to apply for API keys from Twitter (see Twitter Developers: Getting Started). After you have obtained your keys, provide them to NASTY in the [twitter_api] section of the nasty.toml configuration file.

Idify/unidify also support operating on batch results (and keep meta information, that is which Tweets were the results of which requests). To idify batch results in directory out/:

$ nasty idify --in-dir out/ --out-dir out-idified/

To do the reverse:

$ nasty unidify --in-dir out-idified/ --out-dir out/

Python API

To fetch all Tweets about “climate change” written after 14 January 2019 in German:

import nasty
from datetime import datetime

tweet_stream = nasty.Search("climate change",
                            until=datetime(2019, 1, 14),
                            lang="de").request()
for tweet in tweet_stream:
    print(tweet.created_at, tweet.text)

Similar functionality is available in the nasty.Replies and nasty.Thread classes. The returned tweet_stream is an Iterable of nasty.Tweets.

The batch functionality is available in the nasty.Batch class. To read the output of a batch execution (for example, from nasty batch) written to directory out/:

import nasty
from pathlib import Path

results = nasty.BatchResults(Path("out/"))
for entry in results:
    print("Tweets that matched query '{}' (completed at {}):"
          .format(entry.request.query, entry.completed_at))
    for tweet in results.tweets(entry):
        print("-", tweet)

A comprehensive Python API documentation is coming in the future. For now, the existing code should be relatively easy to understand.

Contributing

Please feel free to submit bug reports and pull requests!

There are the Makefile-helpers to run the plethora of auxiliary development tools. See make help for detailed descriptions. The most important commands are:

usage: make <target>

Targets:
  help        Show this help message.
  devinstall  Install the project in editable mode with all test and dev dependencies (in a virtual environment).
  test        Run all tests and report test coverage.
  check       Run linters and perform static type-checking.
  format      Auto format all code.
  publish     Build and check source and binary distributions.
  clean       Remove all created cache/build files, test/coverage reports, and virtual environments.

Acknowledgements

License

Copyright 2019-2020 Lukas Schmelzeisen. Licensed under the Apache License, Version 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nasty-0.2.7.tar.gz (102.9 kB view details)

Uploaded Source

Built Distribution

nasty-0.2.7-py3-none-any.whl (61.7 kB view details)

Uploaded Python 3

File details

Details for the file nasty-0.2.7.tar.gz.

File metadata

  • Download URL: nasty-0.2.7.tar.gz
  • Upload date:
  • Size: 102.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.11

File hashes

Hashes for nasty-0.2.7.tar.gz
Algorithm Hash digest
SHA256 5f17d5c8ab98f432ab41c1f11ccd465d92f2171918910bf83017023168b47f0c
MD5 d012d18e87e24df3f54309c3ba48fec1
BLAKE2b-256 5cbb6ebe2961ff917a1ed7a7026b3b9f58802816a6a6c5f3601a917d7fac7a64

See more details on using hashes here.

File details

Details for the file nasty-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: nasty-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 61.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.11

File hashes

Hashes for nasty-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 b6875a838171335435ca04e6290c941d5c4c5d039c6a7f5d8ef08a98800de929
MD5 891f5f5b320cfb76b062b95922d57de2
BLAKE2b-256 9514a2cd037c9ce6fbff77400debbd514b2aa426e2860e96ba16c6f374de49b8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page