A Twitter crawler for Python 3 based on Twitter's public API

These details have not been verified by PyPI

Project links

Project description

Twicorder Search

A Twitter crawler for Python 3 based on Twitter’s public API.

Supported end points

Twicorder Search currently supports the following end points:

/1.1/followers/ids
/1.1/friends/list
/1.1/search/tweets
/1.1/statuses/lookup
/1.1/statuses/user_timeline
/1.1/users/lookup

To add a new end point, fork the repository and add a new query type to src/twicorder/queries/request/endpoints. New endpoints should inherit from BaseQuery or one of its derivatives and must implement name, endpoint and result_type.

Installation

Twicorder Search can be installed using PIP:

pip install twicorder-search

For a more comprehensive guide using a virtual environment, see Installation using Python 3 virtual environments

Running Twicorder

After installing, there will be a new executable available, twicorder. Use this to run the application:

twicorder
Usage: twicorder [OPTIONS] COMMAND [ARGS]...

  Twicorder Search

Options:
  --project-dir TEXT  Root directory for project
  --help              Show this message and exit.

Commands:
  run    Start crawler
  utils  Utility functions

The project dir is where Twicorder stores temporary files and logs. To specify a project directory other than the default, use the flag --project-dir.:

twicorder --project-dir /path/to/my_project

If not provided, the current working directory is used.

Configuration

Twicorder can be configured by passing parameters in the command line interface or by setting environment variables. The environment variables are laid out similar to their CLI counterparts.

Specifying a task generator with CLI

twicorder run --task-gen user_timeline

Specifying a task generator with environment variable

export TWICORDER_RUN_TASK_GEN="user_timeline"

Full list of CLI options:

twicorder run --help
Usage: twicorder run [OPTIONS]

  Start crawler

Options:
  --consumer-key TEXT             Twitter consumer key  [required]
  --consumer-secret TEXT          Twitter consumer secret  [required]
  --access-token TEXT             Twitter access token  [required]
  --access-secret TEXT            Twitter access secret  [required]
  --out-dir TEXT                  Custom output dir for crawled data
  --out-extension TEXT            File extension for crawled files (.txt or
                                  .zip)
  --task-file TEXT                Yaml file containing tasks to execute
  --full-user-mentions            For mentions, look up full user data
  --appdata-token TEXT            App data token
  --user-lookup-interval INTEGER  Minutes between lookups of the same user
                                  [default: 15]
  --appdata-timeout FLOAT         Seconds to timeout for internal data store
                                  [default: 5.0]
  --task-gen <TEXT TEXT>...       Task generator(s) to use. Example: "user_id
                                  name_pattern=/tmp/**/*_ids.txt,delimiter=,"
                                  [default: config]
  --remove-duplicates             Ensures duplicated tweets/users are not
                                  recorded. Saves space, but can slow down the
                                  crawler.  [default: True]
  --help                          Show this message and exit.

Task generators

Twicorder can be configured with one or more task generators for creating API requests.

Tasks file

The tasks file is the default task generator for Twicorder and is used when no generator is specified. By default Twocorder searches the project root for a file called tasks.yaml.

PROJECT_ROOT
 ├── appdata
 │   └── twicorder.sql
 ├── logs
 │   └── twicorder.log
 └── tasks.yaml

It is however possible to specify a different file path using --task-file:

twicorder --task-file /path/to/my_file.yaml

Example tasks.yaml file

Use this file to define the queries you wish to run and where to store their output data, relative to the output directory. Frequency is given in minutes and defines how often a new scan will be triggered for the given query.

# Tasks
#
# Queries are added on the form listed below.
#
# free_search:                  # endpoint name
#   - frequency: 60             # Interval between repeating queries in minutes
#     output: github/mentions   # Output directory, relative to project directory
#     kwargs:                   # Keyword Arguments to feed to endpoint
#       q: @github              #   "q" for "query" in the case of free_search
#
# See https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators
# for how to use free search to find mentions, replies, hashtags etc.
#
# See https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets
# for keyword arguments to use with search.
#
# See https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline
# for keyword arguments to use with user timelines.

user_timeline:
  - frequency: 60
    output: "github/timeline"
    kwargs:
      screen_name: "github"
  - frequency: 120
    output: "nasa/timeline"
    kwargs:
      screen_name: "NASA"

free_search:
  - frequency: 60
    output: "github/mentions"
    kwargs:
      q: "@github"
  - frequency: 60
    output: "github/replies"
    kwargs:
      q: "to:github"
  - frequency: 60
    output: "github/hashtags"
    kwargs:
      q: "#github"

User Lookup

The User Lookup generator takes one or more files with delimited user ids or user names as input. It then generates tasks to fetch user objects for each id or user name.

twicorder run --task-gen user_lookups name_pattern=/taskgen/*.txt,lookup_method=username

Keyword Argument	Type	Description
`name_pattern`	`str`	POSIX style search pattern
`delimiter`	`str`	Default: `"\n"`
`lookup_method`	`str`	`"id"` or `"username"`

User Timeline

The User Timeline generator takes one or more files with delimited user ids or user names as input. It then generates tasks to fetch tweets for each user's timeline.

twicorder run --task-gen user_timeline name_pattern=/taskgen/*.txt,lookup_method=id,max_requests=5

Keyword Argument	Type	Description
`name_pattern`	`str`	POSIX style search pattern
`delimiter`	`str`	Default: `"\n"`
`lookup_method`	`str`	`"id"` or `"username"`
`max_requests`	`int`	Max number of requests before the task is considered done
`max_age`	`int`	Max age in days for a tweet before the query should be considered done

Create new task generator

Twicorder supports creating custom task generators. To create a generator, create a class that inherits from BaseTaskGenerator and implement the name class attribute and the fetch() method. See twicorder/tasks/generators/user_lookup_generator.py for an example.

Place the custom task generator in a suitable directory and point to said directory with the environment variable TWICORDER_TASKGEN_PATH:

export TWICORDER_TASKGEN_PATH="/path/to/my/generator_dir"

The task generator file name must end in _generator.py:

$TWICORDER_TASKGEN_PATH
 ├── __init__.py
 └── custom_task_generator.py

Clearing temporary files or logs

Use the utils command to clean up temporary files and logs:

twicorder utils --help
Usage: twicorder utils [OPTIONS]

  Utility functions

Options:
  --clear-cache  Clear cache and exit
  --purge-logs   Purge logs and exit
  --help         Show this message and exit.

Docker

Docker Compose Examples

Crawl data based on entries in the tasks file.

version: "3"

services:
  twicorder-search:
    build: ./
    image: twicorder-search:dev
    restart: unless-stopped
    container_name: twicorder-search
    network_mode: bridge
    environment:
      - TWICORDER_RUN_CONSUMER_KEY=XXXXXXXXXXXXXXXXXXXXXXXXX
      - TWICORDER_RUN_CONSUMER_SECRET=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      - TWICORDER_RUN_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      - TWICORDER_RUN_ACCESS_SECRET=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      - TWICORDER_RUN_REMOVE_DUPLICATES=0
      - TWICORDER_RUN_APPDATA_TOKEN=search
    volumes:
      - /home/user/project/data:/data
      - /home/user/project/config:/config

Crawl tweets using the user_timeline task generator. The generator reads all *.txt files located in /home/user/project/taskgen on the host system (name_pattern=/taskgen/*.txt) and expects to find one user ID (lookup_method=id) per line. For each user the number of page results are limited to 5 (max_requests=5).

version: "3"

services:
  twicorder-timeline:
    build: ./
    image: twicorder-timeline:dev
    restart: on-failure
    container_name: twicorder-timeline
    network_mode: bridge
    environment:
      - TWICORDER_RUN_CONSUMER_KEY=XXXXXXXXXXXXXXXXXXXXXXXXX
      - TWICORDER_RUN_CONSUMER_SECRET=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      - TWICORDER_RUN_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      - TWICORDER_RUN_ACCESS_SECRET=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      - TWICORDER_RUN_FULL_USER_MENTIONS=0
      - TWICORDER_RUN_REMOVE_DUPLICATES=0
      - TWICORDER_RUN_APPDATA_TOKEN=timeline
      - TWICORDER_RUN_TASK_GEN=user_timeline name_pattern=/taskgen/*.txt,lookup_method=id,max_requests=5
    volumes:
      - /home/user/project/data:/data
      - /home/user/project/taskgen:/taskgen

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Sep 6, 2021

0.2.9

May 20, 2019

0.2.7

May 19, 2019

0.2.6

May 14, 2019

0.2.5

May 14, 2019

0.2.4

May 14, 2019

0.2.2

May 12, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twicorder_search-0.3.0.tar.gz (36.4 kB view details)

Uploaded Sep 6, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

twicorder_search-0.3.0-py3-none-any.whl (50.1 kB view details)

Uploaded Sep 6, 2021 Python 3

File details

Details for the file twicorder_search-0.3.0.tar.gz.

File metadata

Download URL: twicorder_search-0.3.0.tar.gz
Upload date: Sep 6, 2021
Size: 36.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.2

File hashes

Hashes for twicorder_search-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`842aa8bbb6d47591007df75c228d726f52ad25f52297ddb550e878c2ecb6311f`
MD5	`ed7886d4cf3d7692836e191a95340c66`
BLAKE2b-256	`d833d7b7116a62d71c148ed67e9b5a95dbe9eb489c6aaa9e5868a5c4d9fa261b`

See more details on using hashes here.

File details

Details for the file twicorder_search-0.3.0-py3-none-any.whl.

File metadata

Download URL: twicorder_search-0.3.0-py3-none-any.whl
Upload date: Sep 6, 2021
Size: 50.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.2

File hashes

Hashes for twicorder_search-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`22e5388d4d725a703a7c1fc96963ba7181b7e19d706ebc39902fb52d32312bfd`
MD5	`9163b68f98885961ba05155cfaae433a`
BLAKE2b-256	`a92802d39922f8a3c6a7bb55f3d33f1f6d700c8641c8ee72e2a5f191d245613e`

See more details on using hashes here.

twicorder-search 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Twicorder Search

Supported end points

Installation

Running Twicorder

Configuration

Task generators

Tasks file

Example tasks.yaml file

User Lookup

User Timeline

Create new task generator

Clearing temporary files or logs

Docker

Docker Compose Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes