Skip to main content

Pull documentation from the web and convert to clean markdown

Project description

docpull

Pull documentation from any website and convert it to clean, AI-ready Markdown.

Python 3.9+ PyPI version Downloads License: MIT

Install

pip install docpull

Usage

# Basic fetch
docpull https://docs.example.com

# With options
docpull https://aptos.dev --max-pages 100 --output-dir ./docs

# Filter paths
docpull https://docs.example.com --include-paths "/api/*" --exclude-paths "/changelog/*"

# Enable caching for incremental updates
docpull https://docs.example.com --cache

# JavaScript-heavy sites
pip install docpull[js]
docpull https://spa-site.com --js

Profiles

docpull https://site.com --profile rag      # Optimized for RAG/LLM (default)
docpull https://site.com --profile mirror   # Full site archive with caching
docpull https://site.com --profile quick    # Fast sampling (50 pages, depth 2)

Options

Crawl:
  --max-pages N           Maximum pages to fetch
  --max-depth N           Maximum crawl depth
  --include-paths P       Only crawl matching URL patterns
  --exclude-paths P       Skip matching URL patterns
  --js                    Enable JavaScript rendering

Cache:
  --cache                 Enable caching for incremental updates
  --cache-dir DIR         Cache directory (default: .docpull-cache)
  --cache-ttl DAYS        Days before cache expires (default: 30)

Content:
  --streaming-dedup       Real-time duplicate detection
  --language CODE         Filter by language (e.g., en)

Output:
  --output-dir, -o DIR    Output directory (default: ./docs)
  --dry-run               Show what would be fetched
  --verbose, -v           Verbose output

See docpull --help for all options.

Python API

import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async def main():
    config = DocpullConfig(
        url="https://docs.example.com",
        profile=ProfileName.RAG,
        crawl={"max_pages": 100},
        cache={"enabled": True},
    )

    async with Fetcher(config) as fetcher:
        async for event in fetcher.run():
            if event.type == EventType.FETCH_PROGRESS:
                print(f"{event.current}/{event.total}: {event.url}")

        print(f"Done: {fetcher.stats.pages_fetched} pages")

asyncio.run(main())

Output

Each page becomes a Markdown file with YAML frontmatter:

---
title: "Getting Started"
source: https://docs.example.com/guide
---

# Getting Started
...

Security

  • HTTPS-only, mandatory robots.txt compliance
  • Blocks private/internal network IPs
  • Path traversal and XXE protection

Troubleshooting

docpull --doctor              # Check installation
docpull URL --verbose         # Verbose output
docpull URL --dry-run         # Test without downloading

Links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-2.2.0.tar.gz (80.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpull-2.2.0-py3-none-any.whl (91.3 kB view details)

Uploaded Python 3

File details

Details for the file docpull-2.2.0.tar.gz.

File metadata

  • Download URL: docpull-2.2.0.tar.gz
  • Upload date:
  • Size: 80.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpull-2.2.0.tar.gz
Algorithm Hash digest
SHA256 75494e52b3258c176d95ba357cc49a19f660f6b38e3da0d75ebcfefadc4cf41b
MD5 7090974f845f58e87b282b5a7f6c095f
BLAKE2b-256 a8e94c2595dae1d131e7697ae40c7b950be5e3e6365f3a7264a884c537d69c19

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-2.2.0.tar.gz:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docpull-2.2.0-py3-none-any.whl.

File metadata

  • Download URL: docpull-2.2.0-py3-none-any.whl
  • Upload date:
  • Size: 91.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpull-2.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7707bdbe45ffb109e9140a58046496fcc1bcdb1761b95d0ffdf893a07970df1f
MD5 2b5eff661547d1f8de7a615527ad6597
BLAKE2b-256 445d8edd793c26562e81f32db958b20085aeee5ce4198931eba6356889fc9c84

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-2.2.0-py3-none-any.whl:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page