Skip to main content

Lightweight utilities for processing Reddit data dumps in ZST format

Project description

redditdumps

Lightweight Python utilities for processing Reddit data dumps in ZST format.

These dumps are commonly found on Academic Torrents (Pushshift archives) and contain newline-delimited JSON compressed with Zstandard.

Installation

uv add redditdumps

Or with pip:

pip install redditdumps

Usage

Read a ZST file into a DataFrame

import redditdumps as rd

# Read entire file
df = rd.read_zst("RC_2024-01.zst")

# Filter by subreddit
df = rd.read_zst("RC_2024-01.zst", subreddit="science")

# Filter by multiple subreddits
df = rd.read_zst("RC_2024-01.zst", subreddit=["science", "askscience"])

# Select specific columns
df = rd.read_zst("RC_2024-01.zst", columns=["author", "body", "score"])

# Combine filters
df = rd.read_zst(
    "RC_2024-01.zst",
    subreddit="python",
    columns=rd.MINIMAL_COMMENT_COLUMNS,
    max_lines=100000,
)

Inspect file schema

# Discover columns in a file
schema = rd.inspect_schema("RC_2024-01.zst", sample_size=1000)
for col, info in schema.items():
    print(f"{col}: {info['type']} ({info['count']} records)")

Built-in column schemas

# Common column sets for convenience
rd.COMMENT_COLUMNS       # All standard comment fields
rd.SUBMISSION_COLUMNS    # All standard submission fields
rd.MINIMAL_COMMENT_COLUMNS   # Lightweight subset for comments
rd.MINIMAL_SUBMISSION_COLUMNS  # Lightweight subset for submissions

File naming conventions

  • RC_*.zst - Reddit Comments
  • RS_*.zst - Reddit Submissions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redditdumps-0.1.0.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redditdumps-0.1.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file redditdumps-0.1.0.tar.gz.

File metadata

  • Download URL: redditdumps-0.1.0.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for redditdumps-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a520ffe0e404ce9d7dab781524319dbecb6317112ebc06e4a103dc41978341c4
MD5 f08c626456cef76519afa9a3bbfb339a
BLAKE2b-256 5eb35f86e3493d487fc31159217aae8f88a09c17cef28f0136c122e68bdea2b6

See more details on using hashes here.

File details

Details for the file redditdumps-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: redditdumps-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for redditdumps-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8dfa046b98beed10eca5fe6b9cb6a2889d371cdd1b6170bb8a3961372c49f396
MD5 b7737e608e82d7922065b02eb976d887
BLAKE2b-256 d77b2a1865a048285482cdf51fe023dab017bb51e41c2e850dccb3782ef09275

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page