Pull documentation from the web and convert to clean markdown

These details have not been verified by PyPI

Project description

docpull

Pull documentation from any website and convert it to clean, AI-ready Markdown.

Install

pip install docpull

Usage

# Basic fetch
docpull https://docs.example.com

# With options
docpull https://aptos.dev --max-pages 100 --output-dir ./docs

# Filter paths
docpull https://docs.example.com --include-paths "/api/*" --exclude-paths "/changelog/*"

# Enable caching for incremental updates
docpull https://docs.example.com --cache

# JavaScript-heavy sites
pip install docpull[js]
docpull https://spa-site.com --js

Profiles

docpull https://site.com --profile rag      # Optimized for RAG/LLM (default)
docpull https://site.com --profile mirror   # Full site archive with caching
docpull https://site.com --profile quick    # Fast sampling (50 pages, depth 2)

Options

Crawl:
  --max-pages N           Maximum pages to fetch
  --max-depth N           Maximum crawl depth
  --include-paths P       Only crawl matching URL patterns
  --exclude-paths P       Skip matching URL patterns
  --js                    Enable JavaScript rendering

Cache:
  --cache                 Enable caching for incremental updates
  --cache-dir DIR         Cache directory (default: .docpull-cache)
  --cache-ttl DAYS        Days before cache expires (default: 30)

Content:
  --streaming-dedup       Real-time duplicate detection
  --language CODE         Filter by language (e.g., en)

Output:
  --output-dir, -o DIR    Output directory (default: ./docs)
  --dry-run               Show what would be fetched
  --verbose, -v           Verbose output

See docpull --help for all options.

Python API

import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async def main():
    config = DocpullConfig(
        url="https://docs.example.com",
        profile=ProfileName.RAG,
        crawl={"max_pages": 100},
        cache={"enabled": True},
    )

    async with Fetcher(config) as fetcher:
        async for event in fetcher.run():
            if event.type == EventType.FETCH_PROGRESS:
                print(f"{event.current}/{event.total}: {event.url}")

        print(f"Done: {fetcher.stats.pages_fetched} pages")

asyncio.run(main())

Output

Each page becomes a Markdown file with YAML frontmatter:

---
title: "Getting Started"
source: https://docs.example.com/guide
---

# Getting Started
...

Security

HTTPS-only, mandatory robots.txt compliance
Blocks private/internal network IPs
Path traversal and XXE protection

Troubleshooting

docpull --doctor              # Check installation
docpull URL --verbose         # Verbose output
docpull URL --dry-run         # Test without downloading

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

3.0.0

Apr 27, 2026

2.5.1

Apr 26, 2026

2.5.0

Apr 26, 2026

2.4.0

Apr 26, 2026

2.3.0

Apr 24, 2026

2.2.0

Dec 15, 2025

This version

2.0.0

Nov 29, 2025

1.5.0

Nov 28, 2025

1.3.0

Nov 20, 2025

1.2.1

Nov 17, 2025

1.2.0

Nov 16, 2025

1.1.0

Nov 14, 2025

1.0.2

Nov 14, 2025

1.0.1

Nov 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-2.0.0.tar.gz (65.2 kB view details)

Uploaded Nov 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docpull-2.0.0-py3-none-any.whl (72.7 kB view details)

Uploaded Nov 29, 2025 Python 3

File details

Details for the file docpull-2.0.0.tar.gz.

File metadata

Download URL: docpull-2.0.0.tar.gz
Upload date: Nov 29, 2025
Size: 65.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpull-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`34d2681cb895b3a06b0058f8c5f4b5d12e46548a4637d0c5d48799e7c709249c`
MD5	`709a26e92a1d445fc62ac6a54e2a9f81`
BLAKE2b-256	`dd248af3781c02cd3a784f1e5b7f1cd5184e79c3028f7db87f98bfc06c9596b0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-2.0.0.tar.gz:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docpull-2.0.0.tar.gz
- Subject digest: 34d2681cb895b3a06b0058f8c5f4b5d12e46548a4637d0c5d48799e7c709249c
- Sigstore transparency entry: 731772525
- Sigstore integration time: Nov 29, 2025
Source repository:
- Permalink: raintree-technology/docpull@a81b33c10b1c37894273d79a19aa6eae276c7f9f
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/raintree-technology
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a81b33c10b1c37894273d79a19aa6eae276c7f9f
- Trigger Event: release

File details

Details for the file docpull-2.0.0-py3-none-any.whl.

File metadata

Download URL: docpull-2.0.0-py3-none-any.whl
Upload date: Nov 29, 2025
Size: 72.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docpull-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c26efb36dcbdb36ea185dc39e546111fc6da9ff33ed367d38520a14e4f1a3ed`
MD5	`13075d0fc66dfe074d28ef17652950a7`
BLAKE2b-256	`dfdfb5a571322be9d33285c6678f802ac7c931d051340aa468ab85289d2b27c1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-2.0.0-py3-none-any.whl:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docpull-2.0.0-py3-none-any.whl
- Subject digest: 9c26efb36dcbdb36ea185dc39e546111fc6da9ff33ed367d38520a14e4f1a3ed
- Sigstore transparency entry: 731772527
- Sigstore integration time: Nov 29, 2025
Source repository:
- Permalink: raintree-technology/docpull@a81b33c10b1c37894273d79a19aa6eae276c7f9f
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/raintree-technology
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a81b33c10b1c37894273d79a19aa6eae276c7f9f
- Trigger Event: release

docpull 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

docpull

Install

Usage

Profiles

Options

Python API

Output

Security

Troubleshooting

Links

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance