Skip to main content

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.

Project description

PMAW: Pushshift Multithread API Wrapper

PyPI Version Python Version License: MIT

Description

PMAW is an ultra minimalist wrapper for the Pushshift API which uses multithreading to retrieve Reddit comments and submissions. General usage is through the PushshiftAPI class which provides methods for interacting with different Pushshift endpoints, please view the Pushshift Docs for more details on the endpoints and accepted parameters. Parameters are provided through keyword arguments when calling the method, some methods will have required parameters. When using a method PMAW will complete all the required API calls to complete the query before returning an array of values, or in the case of search_submission_comment_ids a dictionary is returned mapping the submission id to an array of comment ids.

The following three methods are currently supported:

  • Searching Comments: search_comments
  • Search Submissions: search_submissions
  • Search Submission Comment IDs: search_submission_comment_ids

Getting Started

Installation

PMAW currently supports Python 3.5 or later. To install it via pip, run:

$ pip install pmaw

General Usage

from pmaw import PushshiftAPI()
api = PushshiftAPI(num_workers=5)

Why Multithread?

When building large datasets from Reddit submission and comment data it can require thousands of API calls to the Pushshift API. The time it takes for your code to complete pulling all this data is limited by both your network latency and the response time of the Pushshift server, which can vary throughout the day.

Current API libraries such as PRAW and PSAW currently run requests sequentially, which can cause thousands of API calls to take many hours to complete. Since API requests are I/O-bound they can benefit from being run asynchronously using multiple threads. Implementing intelligent rate limiting can ensure that we minimize the number of rejected requests, and the time it takes to complete.

Features

Rate Limiting

PMAW intelligently rate limits the concurrent requests to the Pushshift server to reach your target provided rate.

Providing a rate_limit value is optional, this defaults to 60 requests per minute which is the recommended value for interacting with the Pushshift API. Increasing this value above 60 will increase the number of rejected requests and will increase the burden on the Pushshift server. A maximum recommended value is 100 requests per minute.

Additionally, the rate-limiting behaviour can be constrained by the max_sleep parameter which allows you to select a maximum period of time to sleep between requests.

Multithreading

The number of threads to use during multithreading is set with the num_workers parameter. This is optional and defaults to 10, however, you should provide a value as this may not be appropriate for your machine. Increasing the number of threads you use allows you to make more concurrent requests to Pushshift, however, the returns are diminishing as requests are constrained by the rate-limit. The optimal number of threads for requests is between 10 and 20 depending on the current response time of the Pushshift server.

When selecting the number of threads you can follow one of the two methodologies:

  • Number of processors on the machine, multiplied by 5
  • Minimum value of 32 and the number of processors plus 4

If you are unsure how many processors you have use: os.cpu_count().

Unsupported

  • asc sort is unsupported
  • searching for submissions or comments not by id is currently unsupported
  • aggs are unsupported, as PMAW is intended to be used for collecting large numbers of submissions or comments. Use PSAW for aggregation requests.

Features Requests

  • For feature requests please open an issue with the feature request label, this will allow features to be better prioritized for future releases

Examples

Comments

Search Comments by IDs

comment_ids = ['gjacwx5','gjad2l6','gjadatw','gjadc7w','gjadcwh',
  'gjadgd7','gjadlbc','gjadnoc','gjadog1','gjadphb']
comments_arr = api.search_comments(ids=comment_ids)

You can supply a single comment by passing the id as a string or an array with a length of 1 to ids

Detailed Example

Search Comment IDs by Submission ID

post_ids = ['kxi2w8','kxi2g1','kxhzrl','kxhyh6','kxhwh0',
  'kxhv53','kxhm7b','kxhm3s','kxhg37','kxhak9']
comment_id_dict = api.search_submission_comment_ids(ids=post_ids)

You can supply a single submission by passing the id as a string or an array with a length of 1 to ids

Detailed Example

Submission

Search Submissions by IDs

post_ids = ['kxi2w8','kxi2g1','kxhzrl','kxhyh6','kxhwh0',
  'kxhv53','kxhm7b','kxhm3s','kxhg37','kxhak9']
posts_arr = api.search_submissions(ids=post_ids)

You can supply a single submission by passing the id as a string or an array with a length of 1 to ids

Detailed Example

License

PMAW is released under the MIT License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pmaw-0.0.2.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

pmaw-0.0.2-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file pmaw-0.0.2.tar.gz.

File metadata

  • Download URL: pmaw-0.0.2.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for pmaw-0.0.2.tar.gz
Algorithm Hash digest
SHA256 7d5f24e46f5a8dc6e85b99288b2ad05cad8c8cbb5740233223043ad43627afa7
MD5 3dc2872423375d336a716adc883b9126
BLAKE2b-256 eec857e12831fc686ac19ac27a5336d129374e9092946f77b2837b7a970b5659

See more details on using hashes here.

File details

Details for the file pmaw-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pmaw-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for pmaw-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 78e62ef6f72285aea9c5f77d62a6973bf44281a257f548e45a3908c922371a23
MD5 a2a3b19abc202cb05eb7a9c3f9231a69
BLAKE2b-256 224e4e143b00728ead615fcb8d18be6c2c035e37edc77523c39d458c99d57300

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page