Skip to main content

PRAW-Powered COmmunity & DomaIn tArgeted Link Scraper

Project description

PRAW-CoDiaLS

A niche CLI tool built using the Python Reddit API Wrapper (PRAW) for Community & Domain-Targeted Link Scraping.

Written for Python 3 (3.6 required due to liberal use of fstrings). Third party modules needed: praw, pyaml, and pandas.

Installation

PRAW-CoDiaLS is available from either this repository or via PyPI:

Recommended:

$ pip install praw-codials
$ pip install praw-codials.whl

Usage

Valid Reddit OAuth is required for usage. See Reddit's guide on how to obtain this and set it up. In short you will need to provide a client_id, client_secret, username, password, and client_agent.

usage: praw-codials [-h] -s list,of,subs -d list,of,domains -o client_id,client_secret,password,username,user_agent [-p PATH] /path/to/save/output/ [-l LIMIT] #of posts to search [--new] [--controversial] [--hot] [--top] [--quiet] [--nocomments]

_Python Reddit API Wrapper (PRAW) for Community & Domain-Targeted Link Scraping._

  -h, --help            show this help message and exit
  -s SUBS, --subs SUBS  Subreddit(s) to target. (Comma-separate multiples)
  -d DOMAINS, --domains DOMAINS
                        Domain(s) to collect URLs from. (Comma-separate multiples)

  -o OAUTH, --oauth OAUTH
                        OAuth information, either comma separated values in order (client_id, client_secret, password, username, 
                        user_agent) or a path to a key/value file in YAML format.
 
  -p PATH, --path PATH  Path to save output files (Posts_[DATETIME].csv and Posts_[DATETIME].csv. Default: working directory
  -l LIMIT, --limit LIMIT
                        Maximum threads to check (cannot exceed 1000).
  -t TOP, --top TOP     Search top threads. Specify the timeframe to consider (hour, day, week, month, year, all)
  -c CONTROVERSIAL, --controversial CONTROVERSIAL
                        Search controversial threads. Specify the timeframe to consider (hour, day, week, month, year, all)
  --hot                 Search hot posts.
  -n, --new             Search new posts.
  -q, --quiet           Supress progress reports until jobs are complete.
  -x, --nocomments      Don't collect links in top-level commentsReduces performance limitations caused by the Reddit API

Implementation Details

By default, this tool will return URLs collected from both link submissions (the main post for each thread) and the top-level comments for either text or link submissions (self/link posts), but not their children. This can be optionally disabled at the command line (see below). In a future update, I plan to provide an argument for setting a comment recursion depth; however, any such features will drastically impact performance due to the Reddit API rate-limit.

On that train of thought, please note that Reddit enforces rate limits. This means that this script will likely check between 80-100 pieces of content per minute. To improve performance, this script opens multiple PRAW instances and makes use of the Python multi-threading module to gain a small performance boost. In my limited testing, this improved throughput by approximately 33% from ~65 posts/min to ~85 posts/min when enabling all subreddit search methods (hot/top (all)/new/controversial (all)) with the default post limit (1000) across two subreddits and two domains. This ammounts to checking approximately 8K posts and tens of thousands of comments).

To further limit requests, it tries to ensure that it minimizes the number of comments it could access twice (i.e. in Top and Hot) by storing lists of submission and comment IDs that have already been encountered.

Output reports the following statistics as columns of two separate multi-row CSV files (one for submissions and one for comments, if included):

  • Submissions: post author, post ID, title, url, subreddit, score, upvote ratio (note: these are approximate/obfuscated), and post flair
  • Comments: comment author,comment ID, body (including Markdown), subreddit, score, all of the above attributes as they pertain to the comment's parent submission/thread, and URL's obtained by simple RegEx (multiple entries/rows are generated if multiple links matching the target domain(s) are found in the text body)

If you think that I've missed an important attribute, please let me know!

License

PRAW-CoDiaLS is released under the MIT License. See LICENSE for details.

Contact

To report issues or contribute to this project, please contact me on the GitHub repo for this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

praw-codials-1.0.0.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

praw_codials-1.0.0-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file praw-codials-1.0.0.tar.gz.

File metadata

  • Download URL: praw-codials-1.0.0.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for praw-codials-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7f7d4426cd08c4245fb7e1ad9123ba12b8167a08a84bc6cff707852f8cd1ffbe
MD5 0607694049d352c2ea5e210062730cbc
BLAKE2b-256 29482da244bfb39c8baa6ec96131935fc5858efe66956743ecd21ec3726bc823

See more details on using hashes here.

File details

Details for the file praw_codials-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: praw_codials-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for praw_codials-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cf96fd1b1898b39093b523a90b3987e2e4d2689b8ba6a24272e7c81a850b7157
MD5 3668bf0f972913dd98f6f14c70936a01
BLAKE2b-256 fd42b84fc5fd85ca6063cfbbe88445fb948d8e77208d27693362ed023d3969c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page