Scrape any documentation site to Markdown in seconds
Project description
docscrape
Scrape any documentation site to Markdown in seconds.
docscrape converts any documentation website into clean Markdown files perfect for:
- AI/LLM Context - Feed docs to Claude, GPT, or local models
- Offline Reading - Access docs without internet
- RAG Pipelines - Build searchable knowledge bases
- Development Context - Keep reference docs in your project
Quick Start
# Install (with uv)
uv tool install docscrape
# Or with pip
pip install docscrape
# Scrape any docs - just paste the URL
docscrape https://docs.pipecat.ai
That's it! Output is auto-saved to ./pipecat/ (derived from URL).
Installation
Using pip
# From PyPI
pip install docscrape
# From GitHub (latest)
pip install git+https://github.com/Abdulrahman-Elsmmany/docscrape
Using uv (recommended)
# Install globally
uv tool install docscrape
# Or from GitHub
uv tool install git+https://github.com/Abdulrahman-Elsmmany/docscrape
# Run without installing
uvx docscrape https://docs.example.com
For Development
git clone https://github.com/Abdulrahman-Elsmmany/docscrape
cd docscrape
# With uv (recommended)
uv venv
uv pip install -e ".[dev]"
# Or with pip
pip install -e ".[dev]"
Usage
Basic Usage
# Scrape docs - output auto-detected from URL
docscrape https://docs.example.com
# Custom output directory
docscrape https://docs.example.com -o ./my-docs
# Limit pages (useful for testing)
docscrape https://docs.example.com -m 50
# Verbose output
docscrape https://docs.example.com -v
Resume Interrupted Scrapes
# Start a scrape
docscrape https://docs.example.com -v
# ... connection drops, press Ctrl+C, etc ...
# Resume from where you left off
docscrape https://docs.example.com -r
Filter URLs
# Only include certain paths
docscrape https://docs.example.com -i "/guides/"
# Exclude certain paths
docscrape https://docs.example.com -e "/api-reference/"
# Combine filters
docscrape https://docs.example.com -i "/guides/" -e "/deprecated/"
Command Reference
docscrape [URL] [OPTIONS]
Arguments:
URL Documentation URL to scrape
Options:
-o, --output PATH Output directory [default: auto-detected]
-m, --max-pages INT Maximum pages to scrape (0 = unlimited)
-d, --delay FLOAT Delay between requests in seconds [default: 0.5]
-r, --resume Resume from previous scrape
-v, --verbose Show detailed progress
-i, --include PATTERN URL patterns to include (regex)
-e, --exclude PATTERN URL patterns to exclude (regex)
-V, --version Show version
--help Show help
List Optimized Platforms
docscrape platforms
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Platform ┃ Base URL ┃ Discovery ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ livekit │ https://docs.livekit.io │ llms_txt │
│ pipecat │ https://docs.pipecat.ai │ sitemap │
│ retellai │ https://docs.retellai.com │ sitemap │
└──────────┴────────────────────────────┴───────────┘
Note: Any documentation site works! These platforms have optimized adapters.
Output Structure
./pipecat/
├── _index.md # Human-readable index
├── _manifest.json # Machine-readable metadata
├── index.md # Homepage
├── quickstart.md
├── guides/
│ ├── getting-started.md
│ └── advanced.md
└── api/
└── overview.md
Markdown Files
Each file includes YAML frontmatter:
---
title: "Getting Started with Pipecat"
url: https://docs.pipecat.ai/guides/getting-started
scraped_at: 2024-01-15T10:30:00
word_count: 1523
---
# Getting Started with Pipecat
...
Features
| Feature | Description |
|---|---|
| Universal | Works with any documentation site |
| Smart Defaults | Auto-detects output folder from URL |
| Resumable | Continue interrupted scrapes with -r |
| Clean Output | Markdown with YAML frontmatter |
| Rate Limited | Respects servers with configurable delays |
| Optimized Adapters | Better extraction for known platforms |
Discovery Strategies
docscrape uses multiple strategies to find documentation pages:
- llms.txt - Many docs provide an LLM-friendly index
- sitemap.xml - Standard sitemap discovery
- Recursive Crawl - Follow links when no sitemap exists
Architecture
docscrape/
├── cli.py # Command-line interface
├── core/
│ ├── models.py # Data models (ScrapeConfig, DocumentPage, etc.)
│ └── interfaces.py # Abstract base classes
├── adapters/
│ ├── factory.py # Platform auto-detection
│ ├── generic.py # Works with any site
│ ├── livekit.py # LiveKit-specific
│ ├── pipecat.py # Pipecat-specific
│ └── retellai.py # RetellAI-specific
├── discovery/
│ ├── sitemap.py # Sitemap.xml parsing
│ ├── llms_txt.py # llms.txt parsing
│ └── recursive.py # Link crawling
├── engine/
│ └── crawler.py # Async crawl orchestration
└── storage/
└── filesystem.py # Local file storage
Adding Custom Adapters
Create optimized adapters for specific documentation sites:
from docscrape.adapters.generic import GenericAdapter
from docscrape.adapters.factory import PlatformAdapterFactory
class MyDocsAdapter(GenericAdapter):
BASE_URL = "https://docs.mysite.com"
def __init__(self):
super().__init__(
base_url=self.BASE_URL,
content_selectors=["article", "main"],
)
@property
def name(self) -> str:
return "mysite"
def should_skip(self, url: str) -> bool:
return "/changelog/" in url
# Register the adapter
PlatformAdapterFactory.register_platform(
"mysite",
MyDocsAdapter,
url_patterns=["docs.mysite.com"],
)
Development
# Clone the repo
git clone https://github.com/Abdulrahman-Elsmmany/docscrape
cd docscrape
# Setup with uv (recommended)
uv venv
uv pip install -e ".[dev]"
# Or with pip
pip install -e ".[dev]"
# Run tests
pytest
# Run linter
ruff check src/
# Type checking
mypy src/
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Made with by Abdulrahman Elsmmany
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docscrape-0.2.0.tar.gz.
File metadata
- Download URL: docscrape-0.2.0.tar.gz
- Upload date:
- Size: 67.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40149fdb4b8083b471e51ca6fc8f9ca80d4214a2d715e457c7594de598cb7d12
|
|
| MD5 |
78f14eaa4409f8e7fc1bb3fe1a983498
|
|
| BLAKE2b-256 |
f47e4b59cf770bb0f46a2c2b9521fedde7ff7fda78e5582c94dd4980f1a5587b
|
File details
Details for the file docscrape-0.2.0-py3-none-any.whl.
File metadata
- Download URL: docscrape-0.2.0-py3-none-any.whl
- Upload date:
- Size: 31.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7353c5999c8144843cf3cda688d3b96077bb27102424d3bae1cdb8272a94880
|
|
| MD5 |
725adb73353d1474e4b35651fcaf1c21
|
|
| BLAKE2b-256 |
a5497dd76b2f2cfe6d3a2bebbd52823800a64e6fcf1db18aec5063796e1e2e65
|