Scrape any documentation site to Markdown in seconds

These details have not been verified by PyPI

Project links

Project description

docscrape logo

docscrape

Scrape any documentation site to Markdown in seconds.

docscrape converts any documentation website into clean Markdown files perfect for:

AI/LLM Context - Feed docs to Claude, GPT, or local models
Offline Reading - Access docs without internet
RAG Pipelines - Build searchable knowledge bases
Development Context - Keep reference docs in your project

Quick Start

# Install (with uv)
uv tool install docscrape

# Or with pip
pip install docscrape

# Scrape any docs - just paste the URL
docscrape https://docs.pipecat.ai

That's it! Output is auto-saved to ./pipecat/ (derived from URL).

Installation

Using pip

# From PyPI
pip install docscrape

# From GitHub (latest)
pip install git+https://github.com/Abdulrahman-Elsmmany/docscrape

Using uv (recommended)

# Install globally
uv tool install docscrape

# Or from GitHub
uv tool install git+https://github.com/Abdulrahman-Elsmmany/docscrape

# Run without installing
uvx docscrape https://docs.example.com

For Development

git clone https://github.com/Abdulrahman-Elsmmany/docscrape
cd docscrape

# With uv (recommended)
uv venv
uv pip install -e ".[dev]"

# Or with pip
pip install -e ".[dev]"

Usage

Basic Usage

# Scrape docs - output auto-detected from URL
docscrape https://docs.example.com

# Custom output directory
docscrape https://docs.example.com -o ./my-docs

# Limit pages (useful for testing)
docscrape https://docs.example.com -m 50

# Verbose output
docscrape https://docs.example.com -v

Resume Interrupted Scrapes

# Start a scrape
docscrape https://docs.example.com -v

# ... connection drops, press Ctrl+C, etc ...

# Resume from where you left off
docscrape https://docs.example.com -r

Filter URLs

# Only include certain paths
docscrape https://docs.example.com -i "/guides/"

# Exclude certain paths
docscrape https://docs.example.com -e "/api-reference/"

# Combine filters
docscrape https://docs.example.com -i "/guides/" -e "/deprecated/"

Command Reference

docscrape [URL] [OPTIONS]

Arguments:
  URL                    Documentation URL to scrape

Options:
  -o, --output PATH      Output directory [default: auto-detected]
  -m, --max-pages INT    Maximum pages to scrape (0 = unlimited)
  -d, --delay FLOAT      Delay between requests in seconds [default: 0.5]
  -r, --resume           Resume from previous scrape
  -v, --verbose          Show detailed progress
  -i, --include PATTERN  URL patterns to include (regex)
  -e, --exclude PATTERN  URL patterns to exclude (regex)
  -V, --version          Show version
  --help                 Show help

List Optimized Platforms

docscrape platforms

┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Platform ┃ Base URL                   ┃ Discovery ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ livekit  │ https://docs.livekit.io    │ llms_txt  │
│ pipecat  │ https://docs.pipecat.ai    │ sitemap   │
│ retellai │ https://docs.retellai.com  │ sitemap   │
└──────────┴────────────────────────────┴───────────┘
Note: Any documentation site works! These platforms have optimized adapters.

Output Structure

./pipecat/
├── _index.md           # Human-readable index
├── _manifest.json      # Machine-readable metadata
├── index.md            # Homepage
├── quickstart.md
├── guides/
│   ├── getting-started.md
│   └── advanced.md
└── api/
    └── overview.md

Markdown Files

Each file includes YAML frontmatter:

---
title: "Getting Started with Pipecat"
url: https://docs.pipecat.ai/guides/getting-started
scraped_at: 2024-01-15T10:30:00
word_count: 1523
---

# Getting Started with Pipecat

...

Features

Feature	Description
Universal	Works with any documentation site
Smart Defaults	Auto-detects output folder from URL
Resumable	Continue interrupted scrapes with `-r`
Clean Output	Markdown with YAML frontmatter
Rate Limited	Respects servers with configurable delays
Optimized Adapters	Better extraction for known platforms

Discovery Strategies

docscrape uses multiple strategies to find documentation pages:

llms.txt - Many docs provide an LLM-friendly index
sitemap.xml - Standard sitemap discovery
Recursive Crawl - Follow links when no sitemap exists

Architecture

docscrape/
├── cli.py              # Command-line interface
├── core/
│   ├── models.py       # Data models (ScrapeConfig, DocumentPage, etc.)
│   └── interfaces.py   # Abstract base classes
├── adapters/
│   ├── factory.py      # Platform auto-detection
│   ├── generic.py      # Works with any site
│   ├── livekit.py      # LiveKit-specific
│   ├── pipecat.py      # Pipecat-specific
│   └── retellai.py     # RetellAI-specific
├── discovery/
│   ├── sitemap.py      # Sitemap.xml parsing
│   ├── llms_txt.py     # llms.txt parsing
│   └── recursive.py    # Link crawling
├── engine/
│   └── crawler.py      # Async crawl orchestration
└── storage/
    └── filesystem.py   # Local file storage

Adding Custom Adapters

Create optimized adapters for specific documentation sites:

from docscrape.adapters.generic import GenericAdapter
from docscrape.adapters.factory import PlatformAdapterFactory

class MyDocsAdapter(GenericAdapter):
    BASE_URL = "https://docs.mysite.com"

    def __init__(self):
        super().__init__(
            base_url=self.BASE_URL,
            content_selectors=["article", "main"],
        )

    @property
    def name(self) -> str:
        return "mysite"

    def should_skip(self, url: str) -> bool:
        return "/changelog/" in url

# Register the adapter
PlatformAdapterFactory.register_platform(
    "mysite",
    MyDocsAdapter,
    url_patterns=["docs.mysite.com"],
)

Development

# Clone the repo
git clone https://github.com/Abdulrahman-Elsmmany/docscrape
cd docscrape

# Setup with uv (recommended)
uv venv
uv pip install -e ".[dev]"

# Or with pip
pip install -e ".[dev]"

# Run tests
pytest

# Run linter
ruff check src/

# Type checking
mypy src/

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Made with by Abdulrahman Elsmmany

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Apr 15, 2026

0.4.0

Apr 11, 2026

0.3.2

Feb 19, 2026

0.3.1

Feb 16, 2026

0.3.0

Feb 16, 2026

0.2.1

Feb 2, 2026

This version

0.2.0

Feb 2, 2026

0.1.0

Feb 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docscrape-0.2.0.tar.gz (67.4 kB view details)

Uploaded Feb 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docscrape-0.2.0-py3-none-any.whl (31.6 kB view details)

Uploaded Feb 2, 2026 Python 3

File details

Details for the file docscrape-0.2.0.tar.gz.

File metadata

Download URL: docscrape-0.2.0.tar.gz
Upload date: Feb 2, 2026
Size: 67.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.3

File hashes

Hashes for docscrape-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`40149fdb4b8083b471e51ca6fc8f9ca80d4214a2d715e457c7594de598cb7d12`
MD5	`78f14eaa4409f8e7fc1bb3fe1a983498`
BLAKE2b-256	`f47e4b59cf770bb0f46a2c2b9521fedde7ff7fda78e5582c94dd4980f1a5587b`

See more details on using hashes here.

File details

Details for the file docscrape-0.2.0-py3-none-any.whl.

File metadata

Download URL: docscrape-0.2.0-py3-none-any.whl
Upload date: Feb 2, 2026
Size: 31.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.3

File hashes

Hashes for docscrape-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a7353c5999c8144843cf3cda688d3b96077bb27102424d3bae1cdb8272a94880`
MD5	`725adb73353d1474e4b35651fcaf1c21`
BLAKE2b-256	`a5497dd76b2f2cfe6d3a2bebbd52823800a64e6fcf1db18aec5063796e1e2e65`

See more details on using hashes here.

docscrape 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

docscrape

Quick Start

Installation

Using pip

Using uv (recommended)

For Development

Usage

Basic Usage

Resume Interrupted Scrapes

Filter URLs

Command Reference

List Optimized Platforms

Output Structure

Markdown Files

Features

Discovery Strategies

Architecture

Adding Custom Adapters

Development

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes