A CLI tool to scrape and structure GitBook documentation
Project description
GitBook Scraper
A command-line tool to scrape and structure GitBook documentation into a single, well-organized markdown file.
Features
- 📚 Scrapes any GitBook documentation site
- 🌳 Maintains original document hierarchy and structure
- 📝 Generates a single, well-formatted markdown file
- ⚡ Fast and polite scraping with rate limiting
- 🛠️ Configurable output format and structure
- 🔄 Automatic retry on failed requests
- 📋 Table of contents generation
- 🎯 Selective TOC item extraction
Installation
pip install gitbook-scraper
Quick Start
# Basic usage
gitbook-scraper https://your-gitbook-url.io
# Specify output file
gitbook-scraper https://your-gitbook-url.io -o documentation.md
# With table of contents
gitbook-scraper https://your-gitbook-url.io --toc
# Custom rate limiting
gitbook-scraper https://your-gitbook-url.io --delay 1.0
# Extract specific TOC items
gitbook-scraper https://your-gitbook-url.io -t "Getting Started" -t "Advanced Topics"
Advanced Usage
Command Line Options
Options:
-o, --output TEXT Output file path [default: documentation.md]
--toc Generate table of contents [default: False]
--delay FLOAT Delay between requests in seconds [default: 0.5]
--retries INTEGER Number of retries for failed requests [default: 3]
--timeout INTEGER Request timeout in seconds [default: 10]
--debug Enable debug logging [default: False]
--no-cleanup Keep intermediate files [default: False]
-t, --toc-items TEXT Specific TOC items to extract (can be specified multiple times)
--help Show this message and exit
Python API
from gitbook_scraper import GitbookScraper
# Basic usage
scraper = GitbookScraper(
base_url="https://your-gitbook-url.io",
output_file="documentation.md",
generate_toc=True,
delay=0.5
)
# Extract specific TOC items
scraper = GitbookScraper(
base_url="https://your-gitbook-url.io",
output_file="documentation.md",
generate_toc=True,
toc_items=["Getting Started", "Advanced Topics"]
)
scraper.scrape()
Configuration
The tool can be configured using environment variables:
# Set default output directory
export GITBOOK_SCRAPER_OUTPUT_DIR="./docs"
# Set custom user agent
export GITBOOK_SCRAPER_USER_AGENT="Custom User Agent"
# Set default delay
export GITBOOK_SCRAPER_DELAY=1.0
Error Handling
The scraper implements automatic retries with exponential backoff for failed requests. Common issues and solutions:
- Rate limiting: Increase the delay between requests
- Timeout errors: Increase the timeout value
- Navigation extraction fails: Try different selectors with
--selector-file
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/feature
) - Commit your changes (
git commit -m 'Add feature'
) - Push to the branch (
git push origin feature/feature
) - Open a Pull Request
License
Distributed under the MIT License. See LICENSE
for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gitbook_scraper-0.1.1.tar.gz
(67.8 kB
view details)
Built Distribution
File details
Details for the file gitbook_scraper-0.1.1.tar.gz
.
File metadata
- Download URL: gitbook_scraper-0.1.1.tar.gz
- Upload date:
- Size: 67.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
324e6b8c7475dcb6613d31ba9ada55911dbb3bdad6eae68f712e8b1907fd5649
|
|
MD5 |
f8b54bef6ad37e68d0d25e0d5cc67025
|
|
BLAKE2b-256 |
798d3a06dedb3042478b4d9227313ffd94d6eca7238995e6950795f94416f7ff
|
File details
Details for the file gitbook_scraper-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: gitbook_scraper-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
d94c5f4d5c80fe4b6f000b520770e8eab19730db0268669e9c83a9fc72c3a8d3
|
|
MD5 |
5864697add8b9f0d2fb9dd333f10bc78
|
|
BLAKE2b-256 |
e5a0964a054098c0b9096a70c3a21b0ca7e67ca4d82358041baa3ec28bfac212
|