Pull documentation from the web and convert to clean markdown

These details have not been verified by PyPI

Project links

Project description

docpull

Pull documentation from ANY website and convert to clean, AI-ready markdown.

Why docpull?

Unlike wget or httrack that dump messy HTML, docpull extracts clean markdown perfect for:

Training AI models / RAG systems
Building knowledge bases
Creating searchable documentation archives
Offline documentation reading

Production-ready: Full type safety (mypy), security scanning (Bandit), zero linting issues (Ruff), comprehensive test coverage, and no known vulnerabilities.

Features

Universal: Scrape ANY documentation site - not limited to predefined sources
Smart extraction: Auto-detects main content, removes navigation/ads
Blazing fast: Async/parallel fetching (10x faster than sync)
JavaScript support: Handles JS-heavy sites with Playwright
Progress bars: Beautiful real-time progress with Rich
Sitemap support: Auto-discovers pages via sitemap.xml
Link crawling: Optionally follows links to discover all pages
Secure: Rate limiting, content validation, timeout controls
Clean output: Markdown with YAML frontmatter
Configurable: Control depth, page limits, concurrency
Resumable: Skip already-fetched files

Quick Start

# Install
pip install docpull

# Scrape ANY documentation site
docpull https://aptos.dev
docpull https://docs.anthropic.com
docpull https://go.dev/doc

# Use optimized profiles for popular sites
docpull stripe
docpull nextjs react

# Control scraping behavior
docpull https://newsite.com/docs --max-pages 100 --max-concurrent 20

Installation

# Basic installation
pip install docpull

# With YAML config support
pip install docpull[yaml]

# With JavaScript rendering (for JS-heavy sites)
pip install docpull[js]
python -m playwright install chromium

# Everything
pip install docpull[all]
python -m playwright install chromium

Usage

Scrape Any URL

The primary way to use docpull is by providing any documentation URL:

# Single site
docpull https://aptos.dev

# Multiple sites
docpull https://aptos.dev https://docs.soliditylang.org

# Control crawling
docpull https://docs.example.com \
  --max-pages 200 \
  --max-depth 4 \
  --rate-limit 1.0

Use Optimized Profiles

For popular documentation sites, use shortcut names for optimized scraping:

# Single profile
docpull stripe

# Multiple profiles
docpull stripe plaid nextjs

# Mix profiles and URLs
docpull stripe https://newsite.com/docs

JavaScript Rendering

For sites that require JavaScript to render content:

# Enable JS rendering with Playwright
docpull https://js-heavy-site.com --js

# Combine with other options
docpull https://site.com --js --max-pages 50 --max-concurrent 5

Note: JS rendering is slower but handles modern SPAs and dynamically-loaded content.

Available Profiles

Profile	Site	Optimizations
`stripe`	docs.stripe.com	Filters changelog, focused on API docs
`nextjs`	nextjs.org	Excludes blog/showcase, docs only
`react`	react.dev	Learn & reference sections only
`plaid`	plaid.com	API + guides, excludes marketing
`tailwind`	tailwindcss.com	Documentation only
`bun`	bun.sh	Runtime documentation
`d3`	d3js.org	Data visualization docs
`turborepo`	turbo.build	Monorepo tooling docs

Python API

from docpull import GenericAsyncFetcher

# Scrape any URL (async/parallel)
fetcher = GenericAsyncFetcher(
    url_or_profile="https://aptos.dev",
    output_dir="./docs",
    max_pages=100,
    max_concurrent=20,
    use_js=False,  # Set to True for JS rendering
)
fetcher.fetch()

# Or use a profile
fetcher = GenericAsyncFetcher(
    url_or_profile="stripe",
    output_dir="./docs",
)
fetcher.fetch()

Advanced Options

# Limit pages and depth
docpull https://docs.example.com --max-pages 50 --max-depth 2

# Control concurrent requests (default: 10)
docpull https://site.com --max-concurrent 20

# Enable JavaScript rendering
docpull https://site.com --js

# Custom output directory
docpull stripe --output-dir ./my-docs

# Adjust rate limiting
docpull https://site.com --rate-limit 2.0

# Re-fetch existing files
docpull stripe --no-skip-existing

# Verbose logging
docpull https://site.com --verbose

# Disable progress bars
docpull https://site.com --no-progress

# Dry run (see what would be fetched)
docpull https://site.com --dry-run

Performance

Async/Parallel Fetching makes docpull 10x faster than traditional sync scrapers:

Pages	Sync (old)	Async (new)	Speedup
5	~5.0s	~1.8s	2.8x faster
50	~50s	~6s	8.3x faster
500	~500s	~45s	11x faster

With --max-concurrent 20, even faster for large sites!

Output Format

Each page is saved as markdown with YAML frontmatter:

---
url: https://stripe.com/docs/payments
fetched: 2025-11-13
---

# Payment Intents

Your clean documentation content here...

Files are organized by URL structure:

docs/
├── stripe/
│   ├── api/
│   │   ├── charges.md
│   │   └── customers.md
│   └── payments/
│       └── payment-intents.md
└── aptos_dev/
    ├── guides/
    │   └── getting-started.md
    └── reference/
        └── api.md

How It Works

Discovery: Tries sitemap.xml first, falls back to link crawling
Filtering: Applies URL patterns to focus on documentation
Extraction: Removes nav/footer/ads, extracts main content
Conversion: Converts HTML to clean markdown
Organization: Saves with structure that mirrors the site
Async Magic: Fetches multiple pages concurrently with rate limiting

Configuration File

Create config.yaml for complex setups:

output_dir: ./docs
rate_limit: 0.5
skip_existing: true
log_level: INFO

sources:
  - stripe
  - nextjs
  - react

Run with:

docpull --config config.yaml

Creating Custom Profiles

You can create optimized profiles for your frequently-scraped sites:

from docpull.profiles.base import SiteProfile

MY_PROFILE = SiteProfile(
    name="mysite",
    domains={"docs.mysite.com"},
    sitemap_url="https://docs.mysite.com/sitemap.xml",
    base_url="https://docs.mysite.com/",
    include_patterns=["/docs/", "/api/"],
    exclude_patterns=["/blog/"],
    output_subdir="mysite",
    rate_limit=0.5,
)

Security

docpull is designed with security in mind:

HTTPS-only by default
Private IP blocking (no localhost, 192.168.x.x, etc.)
Content size limits (50MB max per page)
Timeout controls (30s per request)
Rate limiting (async-safe, prevents DoS)
Concurrent connection limits (prevents overwhelming servers)
Content-type validation (only fetches HTML/XML)
Playwright sandboxing (when using --js)

See SECURITY.md for detailed security information.

Comparison with Alternatives

Tool	Output	Works on any site?	Clean extraction?	Speed	JS Support
docpull	Clean markdown	Yes	Yes	Fast (async)	Optional
wget	Raw HTML	Yes	No	Slow (sync)	No
httrack	Raw HTML	Yes	No	Slow (sync)	No
Site-specific	Varies	No	Varies	Varies	No

Troubleshooting

Site requires JavaScript

# Install Playwright support
pip install docpull[js]
python -m playwright install chromium

# Use --js flag
docpull https://site.com --js

Too slow / rate limited

# Reduce concurrent requests
docpull https://site.com --max-concurrent 5 --rate-limit 2.0

Memory issues on large sites

# Limit pages fetched
docpull https://site.com --max-pages 1000

Contributing

Contributions welcome! To add:

New site profiles: Create a profile in docpull/profiles/
Better extraction: Improve content detection in fetchers/base.py
Performance improvements: Optimize async fetching
Bug reports: Use the issue tracker

Development Setup

# Clone and install
git clone https://github.com/raintree-technology/docpull
cd docpull
pip install -e ".[dev]"

# Run all quality checks (as per CI)
black --check .           # Code formatting
ruff check .              # Linting
mypy docpull              # Type checking
bandit -r docpull         # Security scanning
pip-audit                 # Dependency vulnerabilities
pytest --cov=docpull -v   # Tests with coverage

All PRs must pass these checks before merging.

Documentation

License

MIT License - see LICENSE file for details

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.0.0

Apr 27, 2026

2.5.1

Apr 26, 2026

2.5.0

Apr 26, 2026

2.4.0

Apr 26, 2026

2.3.0

Apr 24, 2026

2.2.0

Dec 15, 2025

2.0.0

Nov 29, 2025

1.5.0

Nov 28, 2025

1.3.0

Nov 20, 2025

1.2.1

Nov 17, 2025

1.2.0

Nov 16, 2025

1.1.0

Nov 14, 2025

1.0.2

Nov 14, 2025

This version

1.0.1

Nov 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-1.0.1.tar.gz (36.1 kB view details)

Uploaded Nov 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docpull-1.0.1-py3-none-any.whl (44.4 kB view details)

Uploaded Nov 14, 2025 Python 3

File details

Details for the file docpull-1.0.1.tar.gz.

File metadata

Download URL: docpull-1.0.1.tar.gz
Upload date: Nov 14, 2025
Size: 36.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for docpull-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`065390fecdcf6ca7daf8766de2cba5f85cf985b6e8423762cab035639bef7fa3`
MD5	`3f717daf5b50304698aafd668130049c`
BLAKE2b-256	`4f8c117030256bb6f7b71f31395f48fd77cf0b47a84a874b2e4e7ff97df9d6c6`

See more details on using hashes here.

File details

Details for the file docpull-1.0.1-py3-none-any.whl.

File metadata

Download URL: docpull-1.0.1-py3-none-any.whl
Upload date: Nov 14, 2025
Size: 44.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for docpull-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`33bafacf37fa07d50da96e4089158f5f029f1d34b718664b7878599f42a674ce`
MD5	`81e757e1b6afe7ad25307bf94f9365ac`
BLAKE2b-256	`b6d9c3cb61af98778ab4a494fa5a4f3728e9066277c7af1bd9db3b65fe60f35f`

See more details on using hashes here.

docpull 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

docpull

Why docpull?

Features

Quick Start

Installation

Usage

Scrape Any URL

Use Optimized Profiles

JavaScript Rendering

Available Profiles

Python API

Advanced Options

Performance

Output Format

How It Works

Configuration File

Creating Custom Profiles

Security

Comparison with Alternatives

Troubleshooting

Site requires JavaScript

Too slow / rate limited

Memory issues on large sites

Contributing

Development Setup

Documentation

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes