Pull documentation from the web and convert to clean markdown
Project description
docpull
Pull documentation from any website and convert it to clean, AI-ready Markdown.
Install
pip install docpull
Usage
# Basic fetch
docpull https://docs.example.com
# With options
docpull https://aptos.dev --max-pages 100 --output-dir ./docs
# Filter paths
docpull https://docs.example.com --include-paths "/api/*" --exclude-paths "/changelog/*"
# Enable caching for incremental updates
docpull https://docs.example.com --cache
# JavaScript-heavy sites
pip install docpull[js]
docpull https://spa-site.com --js
Profiles
docpull https://site.com --profile rag # Optimized for RAG/LLM (default)
docpull https://site.com --profile mirror # Full site archive with caching
docpull https://site.com --profile quick # Fast sampling (50 pages, depth 2)
Options
Crawl:
--max-pages N Maximum pages to fetch
--max-depth N Maximum crawl depth
--include-paths P Only crawl matching URL patterns
--exclude-paths P Skip matching URL patterns
--js Enable JavaScript rendering
Cache:
--cache Enable caching for incremental updates
--cache-dir DIR Cache directory (default: .docpull-cache)
--cache-ttl DAYS Days before cache expires (default: 30)
Content:
--streaming-dedup Real-time duplicate detection
--language CODE Filter by language (e.g., en)
Output:
--output-dir, -o DIR Output directory (default: ./docs)
--dry-run Show what would be fetched
--verbose, -v Verbose output
See docpull --help for all options.
Python API
import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async def main():
config = DocpullConfig(
url="https://docs.example.com",
profile=ProfileName.RAG,
crawl={"max_pages": 100},
cache={"enabled": True},
)
async with Fetcher(config) as fetcher:
async for event in fetcher.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}: {event.url}")
print(f"Done: {fetcher.stats.pages_fetched} pages")
asyncio.run(main())
Output
Each page becomes a Markdown file with YAML frontmatter:
---
title: "Getting Started"
source: https://docs.example.com/guide
---
# Getting Started
...
Security
- HTTPS-only, mandatory robots.txt compliance
- Blocks private/internal network IPs
- Path traversal and XXE protection
Troubleshooting
docpull --doctor # Check installation
docpull URL --verbose # Verbose output
docpull URL --dry-run # Test without downloading
Links
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docpull-2.0.0.tar.gz.
File metadata
- Download URL: docpull-2.0.0.tar.gz
- Upload date:
- Size: 65.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34d2681cb895b3a06b0058f8c5f4b5d12e46548a4637d0c5d48799e7c709249c
|
|
| MD5 |
709a26e92a1d445fc62ac6a54e2a9f81
|
|
| BLAKE2b-256 |
dd248af3781c02cd3a784f1e5b7f1cd5184e79c3028f7db87f98bfc06c9596b0
|
Provenance
The following attestation bundles were made for docpull-2.0.0.tar.gz:
Publisher:
publish.yml on raintree-technology/docpull
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docpull-2.0.0.tar.gz -
Subject digest:
34d2681cb895b3a06b0058f8c5f4b5d12e46548a4637d0c5d48799e7c709249c - Sigstore transparency entry: 731772525
- Sigstore integration time:
-
Permalink:
raintree-technology/docpull@a81b33c10b1c37894273d79a19aa6eae276c7f9f -
Branch / Tag:
refs/tags/v2.0.0 - Owner: https://github.com/raintree-technology
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a81b33c10b1c37894273d79a19aa6eae276c7f9f -
Trigger Event:
release
-
Statement type:
File details
Details for the file docpull-2.0.0-py3-none-any.whl.
File metadata
- Download URL: docpull-2.0.0-py3-none-any.whl
- Upload date:
- Size: 72.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c26efb36dcbdb36ea185dc39e546111fc6da9ff33ed367d38520a14e4f1a3ed
|
|
| MD5 |
13075d0fc66dfe074d28ef17652950a7
|
|
| BLAKE2b-256 |
dfdfb5a571322be9d33285c6678f802ac7c931d051340aa468ab85289d2b27c1
|
Provenance
The following attestation bundles were made for docpull-2.0.0-py3-none-any.whl:
Publisher:
publish.yml on raintree-technology/docpull
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docpull-2.0.0-py3-none-any.whl -
Subject digest:
9c26efb36dcbdb36ea185dc39e546111fc6da9ff33ed367d38520a14e4f1a3ed - Sigstore transparency entry: 731772527
- Sigstore integration time:
-
Permalink:
raintree-technology/docpull@a81b33c10b1c37894273d79a19aa6eae276c7f9f -
Branch / Tag:
refs/tags/v2.0.0 - Owner: https://github.com/raintree-technology
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a81b33c10b1c37894273d79a19aa6eae276c7f9f -
Trigger Event:
release
-
Statement type: