A tool to scrape a website and convert it to markdown.
Project description
Markdown Documentation Scraper
This tool is a web scraper built using Scrapy that extracts and consolidates content from a documentation website into a single Markdown file. It is flexible, allowing you to specify a starting URL, restrict crawling to specific domains, and define the output file name.
Features
- Crawls a documentation website starting from a given URL.
- Automatically extracts the main content using
Readabilitywhile skipping non-relevant elements (headers, sidebars, etc.). - Consolidates all pages into a single Markdown file with structured headings.
- Allows optional restriction to specific domains.
- Skips non-English pages to keep the output consistent.
Requirements
Python Libraries
scrapyreadability-lxmllxml[html_clean]langdetectmarkdownify
Install the dependencies:
pip install -r requirements.txt
Usage
Command Syntax
scrapy runspider site_to_markdown.py \
-a start_url=<STARTING_URL> \
[-a allowed_domains=<DOMAIN1,DOMAIN2>] \
[-a output_file=<OUTPUT_FILENAME>]
[-a cookies_file=<PATH_TO_COOKIE_JSON_FILE>]
Arguments
-
start_url(required): The starting URL for the crawler. The scraper will begin its crawl from this URL.- Example:
https://example-docs-site.com
- Example:
-
allowed_domains(optional): A comma-separated list of domains to restrict the crawl. If not provided, the scraper will infer the domain from the start_url.- Example:
example-docs-site.com,docs.example.com
- Example:
-
output_file(optional): The name of the output Markdown file. Default is documentation.md.- Example:
output.md
- Example:
-
cookies_file(optional): The path to a JSON file of cookies to use for requests. Default is None.- Example:
./cookies.json
- Example:
Example Usage
Basic Crawling
To crawl a single domain:
scrapy runspider site_to_markdown.py \
-a start_url=https://example-docs-site.com
Multiple Domains
To allow crawling across multiple domains:
scrapy runspider site_to_markdown.py \
-a start_url=https://example-docs-site.com \
-a allowed_domains=example-docs-site.com,docs.example.com
Custom Output File
To specify a custom output file:
scrapy runspider site_to_markdown.py \
-a start_url=https://example-docs-site.com \
-a output_file=my_documentation.md
Output Format
The scraper generates a single Markdown file with the following structure:
# Documentation
## Page 1 Title
Content for Page 1...
## Page 2 Title
Content for Page 2...
...
Notes
Non-Text Content: The scraper skips non-HTML pages (e.g., images, PDFs).Non-English Pages: Only English pages are processed.URL Validation: Ensures only valid URLs are crawled (ignores javascript:, mailto:, etc.).File Overwriting: If the output file already exists, it will be overwritten.
Contributing
Contributions are welcome! Feel free to submit pull requests or report issues on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file site_to_markdown-0.1.0.tar.gz.
File metadata
- Download URL: site_to_markdown-0.1.0.tar.gz
- Upload date:
- Size: 24.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fef1601743ce2be75001ce8b73bd6175f5d4bdfa9c1575be808624a9e1caa1d
|
|
| MD5 |
ad464cc78c105529a93019eb6bc797df
|
|
| BLAKE2b-256 |
5c5ef2a3abcfba0e189c0002f196983805b3f4a288d3d7ef82abb72524e431fa
|
File details
Details for the file site_to_markdown-0.1.0-py3-none-any.whl.
File metadata
- Download URL: site_to_markdown-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b90d844209c90a3dae9687ccf1547cce2d1b6922fcbf3fe1d284d7a6079484fb
|
|
| MD5 |
f16d197763dc145355750cac1c2efa64
|
|
| BLAKE2b-256 |
26bb38ea0477db8d7989a5b48d9fa6deb0ea0bb9fc93ccb2d8360cf5af1c4cf3
|