Skip to main content

Pull documentation from the web and convert to clean markdown

Project description

docpull

Pull documentation from ANY website and convert to clean, AI-ready markdown.

Python 3.9+ PyPI version License: MIT Code style: black Type checked: mypy Security: bandit

Why docpull?

Unlike wget or httrack that dump messy HTML, docpull extracts clean markdown perfect for:

  • Training AI models / RAG systems
  • Building knowledge bases
  • Creating searchable documentation archives
  • Offline documentation reading

Production-ready: Full type safety (mypy), security scanning (Bandit), zero linting issues (Ruff), comprehensive test coverage, and no known vulnerabilities.

Features

  • Universal: Scrape ANY documentation site - not limited to predefined sources
  • Smart extraction: Auto-detects main content, removes navigation/ads
  • Blazing fast: Async/parallel fetching (10x faster than sync)
  • JavaScript support: Handles JS-heavy sites with Playwright
  • Progress bars: Beautiful real-time progress with Rich
  • Sitemap support: Auto-discovers pages via sitemap.xml
  • Link crawling: Optionally follows links to discover all pages
  • Secure: Rate limiting, content validation, timeout controls
  • Clean output: Markdown with YAML frontmatter
  • Configurable: Control depth, page limits, concurrency
  • Resumable: Skip already-fetched files

Quick Start

# Install
pip install docpull

# Scrape ANY documentation site
docpull https://aptos.dev
docpull https://docs.anthropic.com
docpull https://go.dev/doc

# Use optimized profiles for popular sites
docpull stripe
docpull nextjs react

# Control scraping behavior
docpull https://newsite.com/docs --max-pages 100 --max-concurrent 20

Installation

# Basic installation
pip install docpull

# With YAML config support
pip install docpull[yaml]

# With JavaScript rendering (for JS-heavy sites)
pip install docpull[js]
python -m playwright install chromium

# Everything
pip install docpull[all]
python -m playwright install chromium

Usage

Scrape Any URL

The primary way to use docpull is by providing any documentation URL:

# Single site
docpull https://aptos.dev

# Multiple sites
docpull https://aptos.dev https://docs.soliditylang.org

# Control crawling
docpull https://docs.example.com \
  --max-pages 200 \
  --max-depth 4 \
  --rate-limit 1.0

Use Optimized Profiles

For popular documentation sites, use shortcut names for optimized scraping:

# Single profile
docpull stripe

# Multiple profiles
docpull stripe plaid nextjs

# Mix profiles and URLs
docpull stripe https://newsite.com/docs

JavaScript Rendering

For sites that require JavaScript to render content:

# Enable JS rendering with Playwright
docpull https://js-heavy-site.com --js

# Combine with other options
docpull https://site.com --js --max-pages 50 --max-concurrent 5

Note: JS rendering is slower but handles modern SPAs and dynamically-loaded content.

Available Profiles

Profile Site Optimizations
stripe docs.stripe.com Filters changelog, focused on API docs
nextjs nextjs.org Excludes blog/showcase, docs only
react react.dev Learn & reference sections only
plaid plaid.com API + guides, excludes marketing
tailwind tailwindcss.com Documentation only
bun bun.sh Runtime documentation
d3 d3js.org Data visualization docs
turborepo turbo.build Monorepo tooling docs

Python API

from docpull import GenericAsyncFetcher

# Scrape any URL (async/parallel)
fetcher = GenericAsyncFetcher(
    url_or_profile="https://aptos.dev",
    output_dir="./docs",
    max_pages=100,
    max_concurrent=20,
    use_js=False,  # Set to True for JS rendering
)
fetcher.fetch()

# Or use a profile
fetcher = GenericAsyncFetcher(
    url_or_profile="stripe",
    output_dir="./docs",
)
fetcher.fetch()

Advanced Options

# Limit pages and depth
docpull https://docs.example.com --max-pages 50 --max-depth 2

# Control concurrent requests (default: 10)
docpull https://site.com --max-concurrent 20

# Enable JavaScript rendering
docpull https://site.com --js

# Custom output directory
docpull stripe --output-dir ./my-docs

# Adjust rate limiting
docpull https://site.com --rate-limit 2.0

# Re-fetch existing files
docpull stripe --no-skip-existing

# Verbose logging
docpull https://site.com --verbose

# Disable progress bars
docpull https://site.com --no-progress

# Dry run (see what would be fetched)
docpull https://site.com --dry-run

Performance

Async/Parallel Fetching makes docpull 10x faster than traditional sync scrapers:

Pages Sync (old) Async (new) Speedup
5 ~5.0s ~1.8s 2.8x faster
50 ~50s ~6s 8.3x faster
500 ~500s ~45s 11x faster

With --max-concurrent 20, even faster for large sites!

Output Format

Each page is saved as markdown with YAML frontmatter:

---
url: https://stripe.com/docs/payments
fetched: 2025-11-13
---

# Payment Intents

Your clean documentation content here...

Files are organized by URL structure:

docs/
├── stripe/
│   ├── api/
│   │   ├── charges.md
│   │   └── customers.md
│   └── payments/
│       └── payment-intents.md
└── aptos_dev/
    ├── guides/
    │   └── getting-started.md
    └── reference/
        └── api.md

How It Works

  1. Discovery: Tries sitemap.xml first, falls back to link crawling
  2. Filtering: Applies URL patterns to focus on documentation
  3. Extraction: Removes nav/footer/ads, extracts main content
  4. Conversion: Converts HTML to clean markdown
  5. Organization: Saves with structure that mirrors the site
  6. Async Magic: Fetches multiple pages concurrently with rate limiting

Configuration File

Create config.yaml for complex setups:

output_dir: ./docs
rate_limit: 0.5
skip_existing: true
log_level: INFO

sources:
  - stripe
  - nextjs
  - react

Run with:

docpull --config config.yaml

Creating Custom Profiles

You can create optimized profiles for your frequently-scraped sites:

from docpull.profiles.base import SiteProfile

MY_PROFILE = SiteProfile(
    name="mysite",
    domains={"docs.mysite.com"},
    sitemap_url="https://docs.mysite.com/sitemap.xml",
    base_url="https://docs.mysite.com/",
    include_patterns=["/docs/", "/api/"],
    exclude_patterns=["/blog/"],
    output_subdir="mysite",
    rate_limit=0.5,
)

Security

docpull is designed with security in mind:

  • HTTPS-only by default
  • Private IP blocking (no localhost, 192.168.x.x, etc.)
  • Content size limits (50MB max per page)
  • Timeout controls (30s per request)
  • Rate limiting (async-safe, prevents DoS)
  • Concurrent connection limits (prevents overwhelming servers)
  • Content-type validation (only fetches HTML/XML)
  • Playwright sandboxing (when using --js)

See SECURITY.md for detailed security information.

Comparison with Alternatives

Tool Output Works on any site? Clean extraction? Speed JS Support
docpull Clean markdown Yes Yes Fast (async) Optional
wget Raw HTML Yes No Slow (sync) No
httrack Raw HTML Yes No Slow (sync) No
Site-specific Varies No Varies Varies No

Troubleshooting

Site requires JavaScript

# Install Playwright support
pip install docpull[js]
python -m playwright install chromium

# Use --js flag
docpull https://site.com --js

Too slow / rate limited

# Reduce concurrent requests
docpull https://site.com --max-concurrent 5 --rate-limit 2.0

Memory issues on large sites

# Limit pages fetched
docpull https://site.com --max-pages 1000

Contributing

Contributions welcome! To add:

  • New site profiles: Create a profile in docpull/profiles/
  • Better extraction: Improve content detection in fetchers/base.py
  • Performance improvements: Optimize async fetching
  • Bug reports: Use the issue tracker

Development Setup

# Clone and install
git clone https://github.com/raintree-technology/docpull
cd docpull
pip install -e ".[dev]"

# Run all quality checks (as per CI)
black --check .           # Code formatting
ruff check .              # Linting
mypy docpull              # Type checking
bandit -r docpull         # Security scanning
pip-audit                 # Dependency vulnerabilities
pytest --cov=docpull -v   # Tests with coverage

All PRs must pass these checks before merging.

Documentation

License

MIT License - see LICENSE file for details

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-1.0.1.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpull-1.0.1-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file docpull-1.0.1.tar.gz.

File metadata

  • Download URL: docpull-1.0.1.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for docpull-1.0.1.tar.gz
Algorithm Hash digest
SHA256 065390fecdcf6ca7daf8766de2cba5f85cf985b6e8423762cab035639bef7fa3
MD5 3f717daf5b50304698aafd668130049c
BLAKE2b-256 4f8c117030256bb6f7b71f31395f48fd77cf0b47a84a874b2e4e7ff97df9d6c6

See more details on using hashes here.

File details

Details for the file docpull-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: docpull-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 44.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for docpull-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 33bafacf37fa07d50da96e4089158f5f029f1d34b718664b7878599f42a674ce
MD5 81e757e1b6afe7ad25307bf94f9365ac
BLAKE2b-256 b6d9c3cb61af98778ab4a494fa5a4f3728e9066277c7af1bd9db3b65fe60f35f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page