Skip to main content

Pull documentation from the web and convert to clean markdown

Project description

docpull

Pull documentation from any website and convert it to clean, AI-ready Markdown.

Python 3.9+ PyPI version License: MIT

Install

pip install docpull

Usage

# Basic fetch
docpull https://docs.example.com

# With options
docpull https://aptos.dev --max-pages 100 --output-dir ./docs

# Filter paths
docpull https://docs.example.com --include-paths "/api/*" --exclude-paths "/changelog/*"

# Enable caching for incremental updates
docpull https://docs.example.com --cache

# JavaScript-heavy sites
pip install docpull[js]
docpull https://spa-site.com --js

Profiles

docpull https://site.com --profile rag      # Optimized for RAG/LLM (default)
docpull https://site.com --profile mirror   # Full site archive with caching
docpull https://site.com --profile quick    # Fast sampling (50 pages, depth 2)

Options

Crawl:
  --max-pages N           Maximum pages to fetch
  --max-depth N           Maximum crawl depth
  --include-paths P       Only crawl matching URL patterns
  --exclude-paths P       Skip matching URL patterns
  --js                    Enable JavaScript rendering

Cache:
  --cache                 Enable caching for incremental updates
  --cache-dir DIR         Cache directory (default: .docpull-cache)
  --cache-ttl DAYS        Days before cache expires (default: 30)

Content:
  --streaming-dedup       Real-time duplicate detection
  --language CODE         Filter by language (e.g., en)

Output:
  --output-dir, -o DIR    Output directory (default: ./docs)
  --dry-run               Show what would be fetched
  --verbose, -v           Verbose output

See docpull --help for all options.

Python API

import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async def main():
    config = DocpullConfig(
        url="https://docs.example.com",
        profile=ProfileName.RAG,
        crawl={"max_pages": 100},
        cache={"enabled": True},
    )

    async with Fetcher(config) as fetcher:
        async for event in fetcher.run():
            if event.type == EventType.FETCH_PROGRESS:
                print(f"{event.current}/{event.total}: {event.url}")

        print(f"Done: {fetcher.stats.pages_fetched} pages")

asyncio.run(main())

Output

Each page becomes a Markdown file with YAML frontmatter:

---
title: "Getting Started"
source: https://docs.example.com/guide
---

# Getting Started
...

Security

  • HTTPS-only, mandatory robots.txt compliance
  • Blocks private/internal network IPs
  • Path traversal and XXE protection

Troubleshooting

docpull --doctor              # Check installation
docpull URL --verbose         # Verbose output
docpull URL --dry-run         # Test without downloading

Links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-2.0.0.tar.gz (65.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpull-2.0.0-py3-none-any.whl (72.7 kB view details)

Uploaded Python 3

File details

Details for the file docpull-2.0.0.tar.gz.

File metadata

  • Download URL: docpull-2.0.0.tar.gz
  • Upload date:
  • Size: 65.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpull-2.0.0.tar.gz
Algorithm Hash digest
SHA256 34d2681cb895b3a06b0058f8c5f4b5d12e46548a4637d0c5d48799e7c709249c
MD5 709a26e92a1d445fc62ac6a54e2a9f81
BLAKE2b-256 dd248af3781c02cd3a784f1e5b7f1cd5184e79c3028f7db87f98bfc06c9596b0

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-2.0.0.tar.gz:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docpull-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: docpull-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 72.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpull-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c26efb36dcbdb36ea185dc39e546111fc6da9ff33ed367d38520a14e4f1a3ed
MD5 13075d0fc66dfe074d28ef17652950a7
BLAKE2b-256 dfdfb5a571322be9d33285c6678f802ac7c931d051340aa468ab85289d2b27c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-2.0.0-py3-none-any.whl:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page