Scrape reddit posts into a single markdown file
Project description
Reddit to Markdown Scraper (rd2md)
This Python script uses PRAW (Python Reddit API Wrapper) to scrape interesting posts from a specified subreddit and save them in a formatted Markdown file. It also downloads and saves images associated with the posts.
Features
- Scrapes hot posts from a specified subreddit
- Filters posts based on score and whether they're stickied
- Downloads and saves images from posts
- Formats post content, including comments, into a Markdown file
- Handles both text posts and image posts
- Can be used as a standalone script or imported as a module
Prerequisites
- Python 3.12.3 or higher
- pip (Python package installer)
Installation
You can install rd2md directly from PyPI:
pip install rd-to-md
Setup
To use this script, you need to create a Reddit application to get the necessary credentials:
- Log in to your Reddit account
- Go to https://www.reddit.com/prefs/apps
- Scroll down and click "create another app..."
- Fill out the form:
- Choose a name for your application
- Select "script" as the app type
- For "redirect uri", use http://localhost:8080
- Add a description (optional)
- Click "create app"
After creating the app, note down the following:
- client_id: The string under "personal use script"
- client_secret: The string next to "secret"
Usage
As a Command-Line Tool
After installation, you can run rd2md from the command line:
rd2md --client_id=YOUR_CLIENT_ID --client_secret=YOUR_CLIENT_SECRET [options]
Options:
--client_id
: Your Reddit API client ID (required if not set as an environment variable)--client_secret
: Your Reddit API client secret (required if not set as an environment variable)--user_agent
: User agent for Reddit API (default: "praw_bot")--subreddit
: Subreddit to scrape (default: "LocalLLaMA")--limit
: Number of posts to scrape (default: 3)
Example:
rd2md --client_id=YOUR_CLIENT_ID --client_secret=YOUR_CLIENT_SECRET --subreddit=ProgrammingHumor --limit=10
As an Importable Module
You can also use rd2md as a module in your Python code:
from rd2md import rd2md
# Scrape and save posts
filename, list_contents, list_images = rd2md(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
subreddit_name="ProgrammingHumor",
limit=10
)
Using Environment Variables
Instead of passing the client ID and secret as arguments, you can set them as environment variables:
export REDDIT_CLIENT_ID=your_client_id
export REDDIT_CLIENT_SECRET=your_client_secret
Then you can run the script without these arguments:
python rd2md.py --subreddit=ProgrammingHumor --limit=10
Output
The script creates a new directory named {subreddit}_posts_{date}
in the current working directory. This directory contains:
- A Markdown file named
interesting_posts.md
with the scraped post content - An
images
subdirectory containing any downloaded images
Customization
You can modify the is_interesting
function in the script to change the criteria for which posts are considered interesting:
def is_interesting(post):
return post.score > 100 and not post.stickied
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rd_to_md-0.0.1.tar.gz
.
File metadata
- Download URL: rd_to_md-0.0.1.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bdb2c1c1d6706000b4d667dd8c7b7bcbef50c483464f04cfe2121e81c01b2766 |
|
MD5 | db63aa76f4e3c00470d184736bf46cea |
|
BLAKE2b-256 | b2a631ec3974a25f91c973737964f22b3325b60732a65be696cfc1d564d50951 |
File details
Details for the file rd_to_md-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: rd_to_md-0.0.1-py3-none-any.whl
- Upload date:
- Size: 5.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8d224ee4986f0b81a26bc942e52869e6557a9da68e2baa7d5b556f239177ccc |
|
MD5 | f49a03b0b606e46deb0831b48172f67a |
|
BLAKE2b-256 | 10e3eae94efe8ecd9ea42a8fe8de163e41f680bc93aaa0117e4445fff75495ae |