Skip to main content

Pull documentation from the web and convert to clean markdown

Project description

docpull

Pull documentation from any website and converts it into clean, AI-ready Markdown. Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.

Python 3.9+ PyPI version License: MIT Code style: black Type checked: mypy Security: bandit

Why docpull?

Unlike tools like wget or httrack, docpull extracts only the main content, removing ads, navbars, and clutter. Output is clean Markdown with optional YAML frontmatter—ideal for RAG systems, offline docs, or ML pipelines.

Key Features

  • Works on any documentation site
  • Smart extraction of main content
  • Async + parallel fetching (up to 10× faster)
  • Optional JavaScript rendering via Playwright
  • Sitemap + link crawling
  • URL-based filtering (include/exclude)
  • Rate limiting, timeouts, content-type checks
  • Saves docs in structured Markdown with YAML metadata
  • Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)

Quick Start

pip install docpull
docpull https://aptos.dev
docpull stripe           # use a built-in profile
docpull https://site.com/docs --max-pages 100 --max-concurrent 20

JavaScript-heavy sites

pip install docpull[js]
python -m playwright install chromium
docpull https://site.com --js

Python API

from docpull import GenericAsyncFetcher

fetcher = GenericAsyncFetcher(
    url_or_profile="https://aptos.dev",
    output_dir="./docs",
    max_pages=100,
    max_concurrent=20,
)
fetcher.fetch()

Common Options

  • --max-pages N – limit crawl size
  • --max-depth N – restrict link depth
  • --max-concurrent N – control parallel fetches
  • --js – enable Playwright rendering
  • --output-dir DIR
  • --rate-limit X
  • --no-skip-existing
  • --dry-run

Performance

Async fetching drastically reduces runtime:

Pages Sync Async Speedup
50 ~50s ~6s 8× faster

Higher concurrency yields even better results.

Output Format

Each downloaded page becomes a Markdown file:

---
url: https://stripe.com/docs/payments
fetched: 2025-11-13
---
# Payment Intents
...

Directory layout mirrors the target site's structure.

Configuration File (Optional)

output_dir: ./docs
rate_limit: 0.5
sources:
  - stripe
  - nextjs

Run with:

docpull --config config.yaml

Custom Profiles

Easily define profiles for frequently scraped sites.

from docpull.profiles.base import SiteProfile

MY_PROFILE = SiteProfile(
    name="mysite",
    domains={"docs.mysite.com"},
    include_patterns=["/docs/", "/api/"],
)

Security

  • HTTPS-only
  • Blocks private network IPs
  • 50MB page size limit
  • Timeout controls
  • Validates content-type
  • Playwright sandboxing

Troubleshooting

  • Site requires JS: install Playwright + --js
  • Slow or rate limited: lower concurrency or raise --rate-limit
  • Large sites: set --max-pages

Links

License

MIT License - see LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-1.0.2.tar.gz (32.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpull-1.0.2-py3-none-any.whl (42.3 kB view details)

Uploaded Python 3

File details

Details for the file docpull-1.0.2.tar.gz.

File metadata

  • Download URL: docpull-1.0.2.tar.gz
  • Upload date:
  • Size: 32.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for docpull-1.0.2.tar.gz
Algorithm Hash digest
SHA256 1fca6c7c4c8c7ca65e5cb467fc0329193723ecccd858407abe34c23c56b12053
MD5 7050211dc1222b7fbe62447a59646260
BLAKE2b-256 32bb3794096b2f05b7f682dc76619b0dbd035085810c193d3e3382bf03482eec

See more details on using hashes here.

File details

Details for the file docpull-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: docpull-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 42.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for docpull-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5fabcedefd0e84a35f6c4ebfa276b8bffd336941e2a9f076982d62dea0584465
MD5 ac28619e41b841ebf71c25d8f8b08f17
BLAKE2b-256 1f7ff7f0bb12be1d12f8df38bd8d8ffb35bfbd1e74aa40bf70d0a3992e6f16c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page