A CAPTCHA-safe Python scraper with Cloudflare bypass that downloads Substack posts and converts them to Markdown. Features automatic CAPTCHA solving, human-like scraping delays, and respectful rate limiting.
Project description
pydoll-substack2md
A CAPTCHA-safe Substack scraper with automatic Cloudflare bypass and human-like behavior
pydoll-substack2md is a Python tool for downloading free and premium Substack posts that handles modern web challenges:
๐ก๏ธ Automatic Cloudflare bypass - No manual intervention needed ๐ค CAPTCHA handling - Built-in solver for common challenges ๐ฐ๏ธ Human-like scraping - Random delays and respectful rate limiting ๐ Premium content support - Login capability for paid subscriptions ๐ Organized output - Numbered posts by date, Markdown + HTML formats
Built on Pydoll, a powerful browser automation library that handles anti-bot measures automatically.
Key Features
๐ก๏ธ Anti-Bot Protection Handling
- Automatic Cloudflare bypass - No manual solving needed
- CAPTCHA support - Built-in handling for common challenges
- Stealth mode - Mimics real browser behavior
- Smart retries - Automatic retry with backoff strategies
๐ค Human-Like Scraping
- Random delays - Configurable delay ranges between requests
- Respectful rate limiting - Default 1-3 second delays
- Browser fingerprinting - Realistic browser profiles
- Session persistence - Maintains cookies and state
๐ฅ Content Management
- Markdown conversion - Clean, readable Markdown files
- Image downloading - Local storage with smart naming
- Post numbering - Chronological ordering (01-oldest to newest)
- Continuous updates - Fetch only new posts on subsequent runs
- Premium content - Login support for paid subscriptions
โก Performance & Reliability
- Concurrent scraping - Multiple posts at once
- Async architecture - Non-blocking I/O operations
- Resource optimization - Blocks unnecessary assets
- Error resilience - Continues on individual post failures
How It Handles Protected Sites
Cloudflare Protection
When encountering Cloudflare's "Checking your browser" page, pydoll-substack2md:
- Automatically detects the challenge
- Waits for JavaScript execution
- Solves challenges without user intervention
- Proceeds with scraping once verified
CAPTCHA Handling
The tool uses Pydoll's built-in CAPTCHA solving capabilities:
# Automatic handling in the code
async with tab.expect_and_bypass_cloudflare_captcha():
await tab.go_to(url)
Human-Like Behavior
To avoid detection and respect servers:
- Random delays between 1-3 seconds (configurable)
- Realistic mouse movements and clicks
- Maintains browser session and cookies
- Uses real Chrome/Edge browser (not headless by default)
Requirements
- Python 3.10 or higher, Python 3.11 recommended
- Chrome or Edge browser installed
Installation
Install from PyPI (Recommended)
pip install substack2md
View on PyPI: https://pypi.org/project/substack2md/
Install from Source
Clone the repository:
git clone https://github.com/cognitive-glitch/pydoll-substack2md.git
cd pydoll-substack2md
Usage
After installing with pip install substack2md, you can use the command directly:
# Use the short command
substack2md https://example.substack.com
# Or the full command
substack2markdown https://example.substack.com
# With login for premium content
substack2md https://example.substack.com --login
# Manual login mode (works with any login method)
substack2md https://example.substack.com --manual-login
# Run with custom options
substack2md https://example.substack.com -n 10 --headless
Running from Source with uv
If you cloned the repository and want to run without installing:
# Run directly with uv - it handles all dependencies automatically
uv run substack2md https://example.substack.com
# With login for premium content
uv run substack2md https://example.substack.com --login
# Run with custom options
uv run substack2md https://example.substack.com -n 10 --headless
Configuration
For premium content access, create a .env file:
# Copy the example file
cp .env.example .env
# Edit .env with your credentials
SUBSTACK_EMAIL=your-email@domain.com
SUBSTACK_PASSWORD=your-password
Advanced Options
# Scrape only 10 posts
substack2md https://example.substack.com -n 10
# Run in headless mode (default is non-headless for user intervention)
substack2md https://example.substack.com --headless
# Use concurrent scraping for better performance
substack2md https://example.substack.com --concurrent --max-concurrent 5
# Specify custom directories
substack2md https://example.substack.com -d ./posts --html-directory ./html
# Custom browser path
substack2md https://example.substack.com --browser-path "/path/to/chrome"
# Custom delay between requests (respectful rate limiting)
substack2md https://example.substack.com --delay-min 2 --delay-max 5
# Continuous/incremental mode - only fetch new posts since last run
substack2md https://example.substack.com --continuous
Continuous Fetching & Post Numbering
Automatic Post Numbering
Posts are automatically numbered based on their publication date (oldest first):
01-first-post-title.md02-second-post-title.md03-latest-post-title.md
This makes it easy to read posts in chronological order.
Continuous/Incremental Mode
Use the --continuous or -c flag to only fetch new posts since your last run:
# First run - fetches all posts
substack2md https://example.substack.com
# Later runs - only fetches new posts
substack2md https://example.substack.com --continuous
The tool maintains a .scraping_state.json file in the output directory to track:
- The latest post date and URL
- The highest number used
- Previously scraped URLs
This allows you to run the scraper periodically to keep your collection up-to-date without re-downloading existing posts.
Output Structure
After running the tool, you'll find:
โโโ substack_md_files/ # Markdown versions of posts
โ โโโ {author_name}/ # Organized by Substack author
โ โโโ images/ # Downloaded images for posts
โ โ โโโ image1.jpg
โ โ โโโ image2.png
โ โโโ post1.md
โ โโโ post2.md
โ โโโ ...
โโโ substack_html_pages/ # HTML versions for browsing
โ โโโ {author_name}.html # Single HTML file per author
โโโ data/ # JSON metadata files
โโโ assets/ # CSS/JS for HTML interface
pydoll-substack2md/ โโโ substack_md_files/ # Markdown files organized by author โ โโโ author-name/ โ โโโ post-title-1.md โ โโโ post-title-2.md โ โโโ ... โ โโโ images/ # Downloaded images from posts โ โโโ image1.jpg โ โโโ image2.png โโโ substack_html_pages/ # HTML interface for browsing โ โโโ author-name.html โโโ data/ # JSON metadata for the HTML interface โโโ author-name_data.json
## Development
```bash
# Install development dependencies
uv pip install -e ".[dev]"
# Run tests
uv run pytest
# Format code
uv run black .
# Lint
uv run ruff check . --fix
# Type check
uv run pyright
# Run pre-commit hooks
pre-commit run --all-files
Migration to Pydoll
This project has been migrated from Selenium to Pydoll for improved performance and reliability. Key benefits include:
- Faster execution: Direct Chrome DevTools Protocol connection
- Better reliability: Event-driven architecture for dynamic content
- Async support: Concurrent post scraping capabilities
- Cloudflare handling: Built-in bypass for protected sites
- Resource optimization: Block images/fonts for faster loading
Environment Variables
Configure the tool using a .env file (see .env.example for template):
SUBSTACK_EMAIL: Your Substack account emailSUBSTACK_PASSWORD: Your Substack account passwordHEADLESS: Set totruefor headless browser mode (default:false)BROWSER_PATH: Custom path to Chrome/Edge binary (optional)USER_AGENT: Custom user agent string (optional)
Viewing Output
The tool generates both Markdown files and an HTML interface for easy viewing. To view the raw Markdown files in your browser, you can install the Markdown Viewer browser extension.
Alternatively, you can use the Substack Reader online tool built by @Firevvork, which allows you to read and export free Substack articles directly in your browser without any installation. Note that premium content export is only available in the local version.
Contributing
Contributions are welcome! Please ensure all tests pass and code is formatted before submitting a PR.
License
MIT License - see LICENSE file for details.
Acknowledgments
- Original project by timf34
- Web version by @Firevvork
- Built with Pydoll and html-to-markdown
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file substack2md-0.1.1.tar.gz.
File metadata
- Download URL: substack2md-0.1.1.tar.gz
- Upload date:
- Size: 24.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35d7cc3a8ed9ef2da2613fce2883c50de523c3caa1460d4dd3d0a1d0496ac9c7
|
|
| MD5 |
4fabab6edc609162fc399f28205117e8
|
|
| BLAKE2b-256 |
d4a3264efa99c68a53cdb8a7302778d4df95c2745d1fc278ad55f40d4a78a43f
|
File details
Details for the file substack2md-0.1.1-py3-none-any.whl.
File metadata
- Download URL: substack2md-0.1.1-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7c0f8cd50707d387fa4aa4943597dcc1cc3de4cee08b3308c680107a7d117c0
|
|
| MD5 |
939c02f7692a5a8337a4ecb2cf329063
|
|
| BLAKE2b-256 |
537eb13111d0e54825b21c9fb5d19ae14cd48ad8fdda9f23b7ac214ee09b53db
|