Extract Reddit threads and comment trees for research and analysis

These details have not been verified by PyPI

Project links

Project description

Reddit Comment Harvester

Small Python utility for pulling Reddit threads (posts + comment trees) into structured Python objects or flat CSV for analysis.

Built for research workflows where you already have thread URLs and want repeatable exports of post metadata (title, subreddit, score) and comment data (authors, bodies, scores).

Quick disclaimer: You're responsible for complying with Reddit's Terms of Service and rate limits. This tool adds optional randomized delays to reduce request bursts.

Why This Exists

Most Reddit data extraction requires API keys (PRAW), which limits access. This tool fetches public thread HTML and parses the comment tree directly, giving you:

No API registration required
Full comment threads (titles, authors, scores)
Flat CSV export for immediate analysis
Optional randomized delays to be a good citizen

What it doesn't do: vote, post, access private communities, or handle deleted/removed comments (they're skipped).

About
Getting Started
Usage
Configuration
Example Output
API Reference
CSV Format
Rate Limiting & Responsible Use
Contributing
License

About

Reddit Comment Harvester is a lightweight Python package for research workflows involving Reddit discussions. It extracts thread and comment data using HTML parsing, without requiring API authentication.

Data captured:

Thread: title, author, score, subreddit, post date, comment count
Comments: author, body text, score, depth in tree, comment date

Limitations: Deleted/removed comments are not captured. Comment nesting depth is preserved but trees are flattened in CSV export.

Getting Started

Installation

pip install reddit-comment-harvester

Or from source:

git clone https://github.com/wlyastn/reddit-comment-harvester.git
cd reddit-comment-harvester
pip install -e .

Quick Start

from reddit_url_harvester import RedditScraper

scraper = RedditScraper()
thread = scraper.scrape("https://reddit.com/r/python/comments/abc123/")

print(f"Title: {thread.title}")
print(f"Subreddit: {thread.subreddit}")
print(f"Score: {thread.score}")
print(f"Comments: {len(thread.comments)}")

Usage

Extract a Single Thread

from reddit_url_harvester import RedditScraper

scraper = RedditScraper()
thread = scraper.scrape("https://reddit.com/r/python/comments/abc123/")

from reddit_url_harvester import RedditScraper

scraper = RedditScraper()

urls = [
    "https://reddit.com/r/python/comments/abc123/",
    "https://reddit.com/r/python/comments/def456/",
    "https://reddit.com/r/python/comments/ghi789/",
]

threads = scraper.scrape_batch(urls)
print(f"Scraped {len(threads)} threads")

Process URLs from CSV

from reddit_url_harvester import RedditScraper

scraper = RedditScraper()

results = scraper.scrape_csv(
    input_file="urls.csv",
    output_file="results.csv",
    url_column="URL"
)

print(f"Saved {len(results)} results to results.csv")

Example Output

Thread Object

After scraping a thread, you get a Thread object:

thread.title
# "Why Python is the best language for beginners"

thread.author
# "john_coder"

thread.subreddit
# "python"

thread.score
# 2847

thread.num_comments
# 156

thread.comments[0]
# Comment(
#   author='jane_dev',
#   body='Great explanation! Especially liked the...',
#   score=245,
#   depth=0
# )

CSV Export

When exported to CSV, each row represents one comment (the post becomes a metadata header):

url,title,subreddit,post_id,author,score,comment_author,comment_body,comment_score,comment_depth
https://reddit.com/r/python/comments/abc123/,Why Python is best...,python,abc123,john_coder,2847,jane_dev,"Great explanation! Especially liked...",245,0
https://reddit.com/r/python/comments/abc123/,Why Python is best...,python,abc123,john_coder,2847,mike_learn,"I disagree with point 2 because...",89,1

Configuration

Optional parameters for scraper behavior:

scraper = RedditScraper(
    timeout=60.0,           # Request timeout in seconds (default: 60.0)
    delay=True,             # Add random delays between requests (default: True)
    proxies=None            # Optional proxy config (default: None)
)

timeout: How long to wait for a response (seconds). Increase if you get timeouts on large threads.

delay: Adds 2–6 second random waits between requests. Recommended to keep enabled.

proxies: Use if you need to route requests through a proxy. Format: {"https": "http://proxy:8080"}

Update configuration on an existing scraper:

scraper.set_timeout(45.0)
scraper.set_delay(True)
scraper.set_proxy({"https": "http://proxy.example.com:8080"})

API Reference

RedditScraper Class

scrape(url: str) -> Thread

Scrape a single Reddit thread or comment.

thread = scraper.scrape("https://reddit.com/r/python/comments/abc123/")

scrape_batch(urls: List[str], skip_errors: bool = True) -> List[Thread]

Scrape multiple URLs and return results.

threads = scraper.scrape_batch(urls, skip_errors=True)

scrape_csv(input_file: str, output_file: Optional[str] = None, url_column: str = "URL", skip_errors: bool = True) -> List[dict]

Scrape URLs from a CSV file and optionally save results.

results = scraper.scrape_csv("urls.csv", output_file="results.csv")

Data Models

Thread

Represents a Reddit thread with the following attributes:

thread.title          # str - Thread title
thread.subreddit      # str - Subreddit name
thread.author         # str - Post author username
thread.url            # str - Full Reddit URL
thread.post_id        # str - Reddit post ID
thread.score          # int - Post upvotes/score
thread.comments       # List[Comment] - List of comments

Comment

Represents a comment with the following attributes:

comment.author        # str - Comment author username
comment.body          # str - Comment text/content
comment.score         # int - Comment upvotes/score
comment.timestamp     # str - Comment timestamp

CSV Format

Input Format

Pass a CSV file with a URL column:

URL
https://reddit.com/r/python/comments/abc123/
https://reddit.com/r/python/comments/def456/

Output Format

Results are one comment per row (post metadata repeats):

url,title,subreddit,author,score,comment_author,comment_body,comment_score
https://reddit.com/r/python/comments/abc123/,Title,python,poster,250,commenter,"Nice post",85
https://reddit.com/r/python/comments/abc123/,Title,python,poster,250,other_user,"Disagree",12

Rate Limiting & Responsible Use

Important: You must comply with Reddit's Terms of Service and rate limits.

Best practices:

Keep delay=True (default). It adds 2–6 second waits to reduce request bursts.
Don't scrape the same content repeatedly. Cache results.
Stop immediately if you see 429 (Too Many Requests) errors.
Don't use this for spam, manipulation, or violating Reddit's policies.

If you get rate-limited:

scraper.set_timeout(90.0)  # Increase timeout
scraper.set_delay(True)     # Ensure delays are on
# Then try again after 10+ minutes

Contributing

Contributions welcome

License

MIT License, see LICENSE for details.

Disclaimer & Responsibility

This tool is provided as-is for research and analysis. You are responsible for:

Complying with Reddit's Terms of Service and any legal restrictions in your jurisdiction
Using appropriate rate limits and delays
Respecting Reddit's infrastructure and user privacy
Obtaining consent if needed for your intended use

The maintainers assume no liability for misuse or violations. Use responsibly.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.7

Jan 8, 2026

0.1.6

Jan 8, 2026

0.1.5

Jan 8, 2026

0.1.4

Jan 8, 2026

0.1.3

Jan 8, 2026

0.1.2

Jan 7, 2026

0.1.1

Jan 7, 2026

This version

0.1.0

Jan 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reddit_comment_harvester-0.1.0.tar.gz (9.7 kB view details)

Uploaded Jan 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

reddit_comment_harvester-0.1.0-py3-none-any.whl (8.1 kB view details)

Uploaded Jan 7, 2026 Python 3

File details

Details for the file reddit_comment_harvester-0.1.0.tar.gz.

File metadata

Download URL: reddit_comment_harvester-0.1.0.tar.gz
Upload date: Jan 7, 2026
Size: 9.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for reddit_comment_harvester-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b80ddc58f29d9e0d153c2a19875d41a50fc865219fb963550a9793a33ea340b2`
MD5	`f88de174b7aee07157e6a815d614a4b5`
BLAKE2b-256	`609e097ca3fe7f4e84256b9673e6803cf669c03358e40a72ed438c29315aa137`

See more details on using hashes here.

File details

Details for the file reddit_comment_harvester-0.1.0-py3-none-any.whl.

File metadata

Download URL: reddit_comment_harvester-0.1.0-py3-none-any.whl
Upload date: Jan 7, 2026
Size: 8.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for reddit_comment_harvester-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`46c1e29578ba7986b0b1368b4e0637f4a9f25dd28392dcd5bb7278bc5b100a81`
MD5	`6c235c7794817c694b99554ca41c059a`
BLAKE2b-256	`ed50d50cdaa800fd7adfe2e4d8bcd1cd004be45ec5ae6b0a2a793c08a8cb5037`

See more details on using hashes here.

reddit-comment-harvester 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Reddit Comment Harvester

Why This Exists

Table of Contents

About

Getting Started

Installation

Quick Start

Usage

Extract a Single Thread

Process URLs from CSV

Example Output

Thread Object

CSV Export

Configuration

API Reference

RedditScraper Class

scrape(url: str) -> Thread

scrape_batch(urls: List[str], skip_errors: bool = True) -> List[Thread]

scrape_csv(input_file: str, output_file: Optional[str] = None, url_column: str = "URL", skip_errors: bool = True) -> List[dict]

Data Models

Thread

Comment

CSV Format

Input Format

Output Format

Rate Limiting & Responsible Use

Contributing

License

Disclaimer & Responsibility

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes