Skip to main content

Pull documentation from the web and convert to clean markdown

Project description

docpull

Pull documentation from any website and converts it into clean, AI-ready Markdown. Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.

Python 3.9+ PyPI version License: MIT Code style: black Type checked: mypy Security: bandit

Why docpull?

Unlike tools like wget or httrack, docpull extracts only the main content, removing ads, navbars, and clutter. Output is clean Markdown with optional YAML frontmatter—ideal for RAG systems, offline docs, or ML pipelines.

Key Features

  • Works on any documentation site
  • Smart extraction of main content
  • Async + parallel fetching (up to 10× faster)
  • Optional JavaScript rendering via Playwright
  • Sitemap + link crawling
  • URL-based filtering (include/exclude)
  • Rate limiting, timeouts, content-type checks
  • Saves docs in structured Markdown with YAML metadata
  • Optimized profiles for popular platforms (Stripe, Next.js, React, Plaid, Tailwind, etc.)

Quick Start

pip install docpull
docpull --doctor         # verify installation
docpull https://aptos.dev
docpull stripe           # use a built-in profile
docpull https://site.com/docs --max-pages 100 --max-concurrent 20

JavaScript-heavy sites

pip install docpull[js]
python -m playwright install chromium
docpull https://site.com --js

Python API

from docpull import GenericAsyncFetcher

fetcher = GenericAsyncFetcher(
    url_or_profile="https://aptos.dev",
    output_dir="./docs",
    max_pages=100,
    max_concurrent=20,
)
fetcher.fetch()

Common Options

  • --doctor – verify installation and dependencies
  • --max-pages N – limit crawl size
  • --max-depth N – restrict link depth
  • --max-concurrent N – control parallel fetches
  • --js – enable Playwright rendering
  • --output-dir DIR
  • --rate-limit X
  • --no-skip-existing
  • --dry-run

Performance

Async fetching drastically reduces runtime:

Pages Sync Async Speedup
50 ~50s ~6s 8× faster

Higher concurrency yields even better results.

Output Format

Each downloaded page becomes a Markdown file:

---
url: https://stripe.com/docs/payments
fetched: 2025-11-13
---
# Payment Intents
...

Directory layout mirrors the target site's structure.

Configuration File (Optional)

output_dir: ./docs
rate_limit: 0.5
sources:
  - stripe
  - nextjs

Run with:

docpull --config config.yaml

Custom Profiles

Easily define profiles for frequently scraped sites.

from docpull.profiles.base import SiteProfile

MY_PROFILE = SiteProfile(
    name="mysite",
    domains={"docs.mysite.com"},
    include_patterns=["/docs/", "/api/"],
)

Security

  • HTTPS-only
  • Blocks private network IPs
  • 50MB page size limit
  • Timeout controls
  • Validates content-type
  • Playwright sandboxing

Troubleshooting

  • Installation issues: Run docpull --doctor to diagnose problems
  • Missing dependencies: See TROUBLESHOOTING.md for common fixes
  • Site requires JS: install Playwright + --js
  • Slow or rate limited: lower concurrency or raise --rate-limit
  • Large sites: set --max-pages

For detailed troubleshooting, see TROUBLESHOOTING.md.

Links

License

MIT License - see LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-1.1.0.tar.gz (34.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpull-1.1.0-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file docpull-1.1.0.tar.gz.

File metadata

  • Download URL: docpull-1.1.0.tar.gz
  • Upload date:
  • Size: 34.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpull-1.1.0.tar.gz
Algorithm Hash digest
SHA256 e581ca8c7fdee4d111f2fed783f3767581f3d6efb186bde6e9a06d474a9ade3e
MD5 1e9fb9595e48c916794b4bd7059f95b3
BLAKE2b-256 a35f83ef53ec0d77d70f483090333860bdb9cd4336929b17dde6e05d20b5d66b

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-1.1.0.tar.gz:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docpull-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: docpull-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpull-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9166fef8349c47562024f72d2b7adc718e2ec89fede1132c17fb348d93e720c
MD5 19b5ca2c65d9231f1c8b0a1d6793e2c3
BLAKE2b-256 aab72845a97298bc9510de0801f1f0c57582c9401572027c11ae373a7a075524

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-1.1.0-py3-none-any.whl:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page