A tool to scrape a website and convert it to markdown.
Project description
site_to_markdown
This tool is a web scraper built using Scrapy that extracts and consolidates content from a documentation website into a single Markdown file. It is flexible, allowing you to specify a starting URL, restrict crawling to specific domains, and define the output file name.
Features
- Crawls a documentation website starting from a given URL.
- Automatically extracts the main content using
Readabilitywhile skipping non-relevant elements (headers, sidebars, etc.). - Consolidates all pages into a single Markdown file with structured headings.
- Allows optional restriction to specific domains.
- Skips non-English pages to keep the output consistent.
Installation
This project is now available as a pip package. Install it using:
pip install site-to-markdown
Usage
The scraper can now be run directly as a command-line tool.
Command Syntax
site-to-markdown \
-u <STARTING_URL> \
[-d <DOMAIN1,DOMAIN2>] \
[-o <OUTPUT_FILENAME>] \
[-c <PATH_TO_COOKIE_JSON_FILE>]
[-e <EXCLUDED_FILETYPE_1,EXCLUDED_FILETYPE_2>]
Arguments
-
-u(required): The starting URL for the crawler. The scraper will begin its crawl from this URL.- Example:
https://example-docs-site.com
- Example:
-
-d(optional): A comma-separated list of domains to restrict the crawl. If not provided, the scraper will infer the domain from the start_url.- Example:
example-docs-site.com,docs.example.com
- Example:
-
-o(optional): The name of the output Markdown file. Default is documentation.md.- Example:
output.md
- Example:
-
-c(optional): The path to a JSON file of cookies to use for requests. Default is None.- Example:
./cookies.json
- Example:
-
-e(optional): A comma-separated list of file types to exclude. This filtering is done based on the URL path, not the Content-Type.- Example:
./cookies.json
- Example:
Example Usage
Basic Crawling
To crawl a single domain:
site-to-markdown -u [https://example-docs-site.com](https://example-docs-site.com)
Multiple Domains
To allow crawling across multiple domains:
site-to-markdown -u [https://example-docs-site.com](https://example-docs-site.com) -d example-docs-site.com,docs.example.com
Custom Output File
To specify a custom output file:
site-to-markdown -u [https://example-docs-site.com](https://example-docs-site.com) -o my_documentation.md
Exclude Filetypes
To exclude particular file types based on the HTTP Path.
site-to-markdown -u [https://example-docs-site.com](https://example-docs-site.com) -e rst.txt,md
Output Format
The scraper generates a single Markdown file with the following structure:
# Documentation
## Page 1 Title
Content for Page 1...
## Page 2 Title
Content for Page 2...
...
Notes
Non-Text Content: The scraper skips non-HTML pages (e.g., images, PDFs).Non-English Pages: Only English pages are processed.URL Validation: Ensures only valid URLs are crawled (ignores javascript:, mailto:, etc.).File Overwriting: If the output file already exists, it will be overwritten.
Contributing
Contributions are welcome! Feel free to submit pull requests or report issues on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file site_to_markdown-0.2.0.tar.gz.
File metadata
- Download URL: site_to_markdown-0.2.0.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a64e5733873ebf8e09fb64bae49025214a27303260fb3acd002c2eeaecb9f8a
|
|
| MD5 |
c5f07a11f684e3bf1a5d164781a6650a
|
|
| BLAKE2b-256 |
96819c192c22bf0e300310c4801635070a4968d8dc44594f66872cf50437cb6d
|
File details
Details for the file site_to_markdown-0.2.0-py3-none-any.whl.
File metadata
- Download URL: site_to_markdown-0.2.0-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ceb49b9d500601da8ccf3f73436372b62f3b48d0b0daaec97bef10b0159ce864
|
|
| MD5 |
a30372392e9a9dd5938ffecf394672c2
|
|
| BLAKE2b-256 |
b6380c0eaeb450732ca9827ab6f420c0a1c4094cbd15c00b39ed1237c6a562e6
|