Skip to main content

Scrape any documentation site to Markdown in seconds

Project description

docscrape logo

docscrape

Scrape any documentation site to Markdown in seconds.

Python 3.10+ License: MIT Code style: ruff

docscrape converts any documentation website into clean Markdown files perfect for:

  • AI/LLM Context - Feed docs to Claude, GPT, or local models
  • Offline Reading - Access docs without internet
  • RAG Pipelines - Build searchable knowledge bases
  • Development Context - Keep reference docs in your project

Quick Start

# Install (with uv)
uv tool install docscrape

# Or with pip
pip install docscrape

# Scrape any docs - just paste the URL
docscrape https://docs.pipecat.ai

That's it! Output is auto-saved to ./pipecat/ (derived from URL).

Installation

Using pip

# From PyPI
pip install docscrape

# From GitHub (latest)
pip install git+https://github.com/Abdulrahman-Elsmmany/docscrape

Using uv (recommended)

# Install globally
uv tool install docscrape

# Or from GitHub
uv tool install git+https://github.com/Abdulrahman-Elsmmany/docscrape

# Run without installing
uvx docscrape https://docs.example.com

For Development

git clone https://github.com/Abdulrahman-Elsmmany/docscrape
cd docscrape

# With uv (recommended)
uv venv
uv pip install -e ".[dev]"

# Or with pip
pip install -e ".[dev]"

Usage

Basic Usage

# Scrape docs - output auto-detected from URL
docscrape https://docs.example.com

# Custom output directory
docscrape https://docs.example.com -o ./my-docs

# Limit pages (useful for testing)
docscrape https://docs.example.com -m 50

# Verbose output
docscrape https://docs.example.com -v

Resume Interrupted Scrapes

# Start a scrape
docscrape https://docs.example.com -v

# ... connection drops, press Ctrl+C, etc ...

# Resume from where you left off
docscrape https://docs.example.com -r

Filter URLs

# Only include certain paths
docscrape https://docs.example.com -i "/guides/"

# Exclude certain paths
docscrape https://docs.example.com -e "/api-reference/"

# Combine filters
docscrape https://docs.example.com -i "/guides/" -e "/deprecated/"

Command Reference

docscrape [URL] [OPTIONS]

Arguments:
  URL                    Documentation URL to scrape

Options:
  -o, --output PATH      Output directory [default: auto-detected]
  -m, --max-pages INT    Maximum pages to scrape (0 = unlimited)
  -d, --delay FLOAT      Delay between requests in seconds [default: 0.5]
  -r, --resume           Resume from previous scrape
  -v, --verbose          Show detailed progress
  -i, --include PATTERN  URL patterns to include (regex)
  -e, --exclude PATTERN  URL patterns to exclude (regex)
  -V, --version          Show version
  --help                 Show help

List Optimized Platforms

docscrape platforms
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Platform ┃ Base URL                   ┃ Discovery ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ livekit  │ https://docs.livekit.io    │ llms_txt  │
│ pipecat  │ https://docs.pipecat.ai    │ sitemap   │
│ retellai │ https://docs.retellai.com  │ sitemap   │
└──────────┴────────────────────────────┴───────────┘
Note: Any documentation site works! These platforms have optimized adapters.

Output Structure

./pipecat/
├── _index.md           # Human-readable index
├── _manifest.json      # Machine-readable metadata
├── index.md            # Homepage
├── quickstart.md
├── guides/
│   ├── getting-started.md
│   └── advanced.md
└── api/
    └── overview.md

Markdown Files

Each file includes YAML frontmatter:

---
title: "Getting Started with Pipecat"
url: https://docs.pipecat.ai/guides/getting-started
scraped_at: 2024-01-15T10:30:00
word_count: 1523
---

# Getting Started with Pipecat

...

Features

Feature Description
Universal Works with any documentation site
Smart Defaults Auto-detects output folder from URL
Resumable Continue interrupted scrapes with -r
Clean Output Markdown with YAML frontmatter
Rate Limited Respects servers with configurable delays
Optimized Adapters Better extraction for known platforms

Discovery Strategies

docscrape uses multiple strategies to find documentation pages:

  1. llms.txt - Many docs provide an LLM-friendly index
  2. sitemap.xml - Standard sitemap discovery
  3. Recursive Crawl - Follow links when no sitemap exists

Architecture

docscrape/
├── cli.py              # Command-line interface
├── core/
│   ├── models.py       # Data models (ScrapeConfig, DocumentPage, etc.)
│   └── interfaces.py   # Abstract base classes
├── adapters/
│   ├── factory.py      # Platform auto-detection
│   ├── generic.py      # Works with any site
│   ├── livekit.py      # LiveKit-specific
│   ├── pipecat.py      # Pipecat-specific
│   └── retellai.py     # RetellAI-specific
├── discovery/
│   ├── sitemap.py      # Sitemap.xml parsing
│   ├── llms_txt.py     # llms.txt parsing
│   └── recursive.py    # Link crawling
├── engine/
│   └── crawler.py      # Async crawl orchestration
└── storage/
    └── filesystem.py   # Local file storage

Adding Custom Adapters

Create optimized adapters for specific documentation sites:

from docscrape.adapters.generic import GenericAdapter
from docscrape.adapters.factory import PlatformAdapterFactory

class MyDocsAdapter(GenericAdapter):
    BASE_URL = "https://docs.mysite.com"

    def __init__(self):
        super().__init__(
            base_url=self.BASE_URL,
            content_selectors=["article", "main"],
        )

    @property
    def name(self) -> str:
        return "mysite"

    def should_skip(self, url: str) -> bool:
        return "/changelog/" in url

# Register the adapter
PlatformAdapterFactory.register_platform(
    "mysite",
    MyDocsAdapter,
    url_patterns=["docs.mysite.com"],
)

Development

# Clone the repo
git clone https://github.com/Abdulrahman-Elsmmany/docscrape
cd docscrape

# Setup with uv (recommended)
uv venv
uv pip install -e ".[dev]"

# Or with pip
pip install -e ".[dev]"

# Run tests
pytest

# Run linter
ruff check src/

# Type checking
mypy src/

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Made with by Abdulrahman Elsmmany

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docscrape-0.3.1.tar.gz (67.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docscrape-0.3.1-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file docscrape-0.3.1.tar.gz.

File metadata

  • Download URL: docscrape-0.3.1.tar.gz
  • Upload date:
  • Size: 67.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for docscrape-0.3.1.tar.gz
Algorithm Hash digest
SHA256 56a22aefe6a2b28ad539f52c311586bd1ae665b107e54f84e29aa66dbc372591
MD5 e1e5ce35713000781cec1d03f95d4207
BLAKE2b-256 2676fa4a3f080ba6fb84b91b68658924f4cf1c555194b09da1fadfbbbdb8bb80

See more details on using hashes here.

File details

Details for the file docscrape-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: docscrape-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 32.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for docscrape-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7ffe89d79ba975420c14c6f07362fc97651b62cc8cbe9f8fe97f03561bbf07db
MD5 14f969e3af0deb85ce00a9b6c0427522
BLAKE2b-256 8d2107210bf9c0ae89f083dfa01aadff5989e4eb18619b7d6d5bee8abd8bf207

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page