Skip to main content

A CLI tool to scrape and structure GitBook documentation

Project description

GitBook Scraper

A command-line tool to scrape and structure GitBook documentation into a single, well-organized markdown file.

Features

  • 📚 Scrapes any GitBook documentation site
  • 🌳 Maintains original document hierarchy and structure
  • 📝 Generates a single, well-formatted markdown file
  • ⚡ Fast and polite scraping with rate limiting
  • 🛠️ Configurable output format and structure
  • 🔄 Automatic retry on failed requests
  • 📋 Table of contents generation
  • 🎯 Selective TOC item extraction

Installation

pip install gitbook-scraper

Quick Start

# Basic usage
gitbook-scraper https://your-gitbook-url.io

# Specify output file
gitbook-scraper https://your-gitbook-url.io -o documentation.md

# With table of contents
gitbook-scraper https://your-gitbook-url.io --toc

# Custom rate limiting
gitbook-scraper https://your-gitbook-url.io --delay 1.0

# Extract specific TOC items
gitbook-scraper https://your-gitbook-url.io -t "Getting Started" -t "Advanced Topics"

Advanced Usage

Command Line Options

Options:
  -o, --output TEXT     Output file path [default: documentation.md]
  --toc                 Generate table of contents [default: False]
  --delay FLOAT        Delay between requests in seconds [default: 0.5]
  --retries INTEGER    Number of retries for failed requests [default: 3]
  --timeout INTEGER    Request timeout in seconds [default: 10]
  --debug             Enable debug logging [default: False]
  --no-cleanup        Keep intermediate files [default: False]
  -t, --toc-items TEXT  Specific TOC items to extract (can be specified multiple times)
  --help             Show this message and exit

Python API

from gitbook_scraper import GitbookScraper

# Basic usage
scraper = GitbookScraper(
    base_url="https://your-gitbook-url.io",
    output_file="documentation.md",
    generate_toc=True,
    delay=0.5
)

# Extract specific TOC items
scraper = GitbookScraper(
    base_url="https://your-gitbook-url.io",
    output_file="documentation.md",
    generate_toc=True,
    toc_items=["Getting Started", "Advanced Topics"]
)

scraper.scrape()

Configuration

The tool can be configured using environment variables:

# Set default output directory
export GITBOOK_SCRAPER_OUTPUT_DIR="./docs"

# Set custom user agent
export GITBOOK_SCRAPER_USER_AGENT="Custom User Agent"

# Set default delay
export GITBOOK_SCRAPER_DELAY=1.0

Error Handling

The scraper implements automatic retries with exponential backoff for failed requests. Common issues and solutions:

  • Rate limiting: Increase the delay between requests
  • Timeout errors: Increase the timeout value
  • Navigation extraction fails: Try different selectors with --selector-file

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/feature)
  3. Commit your changes (git commit -m 'Add feature')
  4. Push to the branch (git push origin feature/feature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitbook_scraper-0.1.1.tar.gz (67.8 kB view details)

Uploaded Source

Built Distribution

gitbook_scraper-0.1.1-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file gitbook_scraper-0.1.1.tar.gz.

File metadata

  • Download URL: gitbook_scraper-0.1.1.tar.gz
  • Upload date:
  • Size: 67.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gitbook_scraper-0.1.1.tar.gz
Algorithm Hash digest
SHA256 324e6b8c7475dcb6613d31ba9ada55911dbb3bdad6eae68f712e8b1907fd5649
MD5 f8b54bef6ad37e68d0d25e0d5cc67025
BLAKE2b-256 798d3a06dedb3042478b4d9227313ffd94d6eca7238995e6950795f94416f7ff

See more details on using hashes here.

File details

Details for the file gitbook_scraper-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for gitbook_scraper-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d94c5f4d5c80fe4b6f000b520770e8eab19730db0268669e9c83a9fc72c3a8d3
MD5 5864697add8b9f0d2fb9dd333f10bc78
BLAKE2b-256 e5a0964a054098c0b9096a70c3a21b0ca7e67ca4d82358041baa3ec28bfac212

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page