Convert web pages to markdown
Project description
Webdown
A Python CLI tool for converting web pages to clean, readable Markdown format. Webdown makes it easy to extract content from websites for documentation, notes, content migration, or offline reading.
I made this tool specifically so I could download documentation, convert it to Markdown and feed it into an LLM coding tool.
Why Webdown?
- Clean Conversion: Produces readable Markdown without formatting artifacts
- Selective Extraction: Target specific page sections with CSS selectors
- Customization Options: Control links, images, text wrapping, and more
- Progress Tracking: Visual download progress for large pages with
-pflag - Python Integration: Use as a CLI tool or integrate into your Python projects
Use Cases
Documentation for AI Coding Assistants
Webdown is particularly useful for preparing documentation to use with AI-assisted coding tools like Claude Code, GitHub Copilot, or ChatGPT:
- Convert technical documentation into clean Markdown for AI context
- Extract only the relevant parts of large documentation pages using CSS selectors
- Strip out images and formatting that might consume token context
- Generate well-structured tables of contents for better navigation
- Batch process API documentation for library-specific assistance
# Example: Convert API docs and store for AI coding context
webdown https://api.example.com/docs -s "main" -I -c -w 80 -o api_context.md
Installation
From PyPI
pip install webdown
Install from Source
# Clone the repository
git clone https://github.com/kelp/webdown.git
cd webdown
# Install with pip
pip install .
# Or install with Poetry
poetry install
Usage
Basic usage:
webdown https://example.com/page.html -o output.md
Output to stdout:
webdown https://example.com/page.html
Options
-o, --output: Output file (default: stdout)-t, --toc: Generate table of contents-L, --no-links: Strip hyperlinks-I, --no-images: Exclude images-s, --css SELECTOR: CSS selector to extract specific content-c, --compact: Remove excessive blank lines from the output-w, --width N: Set the line width for wrapped text (0 for no wrapping)-p, --progress: Show download progress bar
Advanced Options:
--single-line-break: Use single line breaks instead of two line breaks--unicode: Use Unicode characters instead of ASCII equivalents--tables-as-html: Keep tables as HTML instead of converting to Markdown--emphasis-mark CHAR: Character(s) to use for emphasis (default: '_')--strong-mark CHARS: Character(s) to use for strong emphasis (default: '**')
Examples
Generate markdown with a table of contents:
webdown https://example.com -t -o output.md
Extract only main content:
webdown https://example.com -s "main" -o output.md
Strip links and images:
webdown https://example.com -L -I -o output.md
Compact output with progress bar and line wrapping:
webdown https://example.com -c -p -w 80 -o output.md
For complete documentation, use the --help flag:
webdown --help
Documentation
API documentation is available online at tcole.net/webdown.
You can also generate the documentation locally with:
make docs # Generate HTML docs in the docs/ directory
make docs-serve # Start a local documentation server at http://localhost:8080
Development
Prerequisites
- Python 3.10+ (3.13 recommended)
- Poetry for dependency management
Setup
# Clone the repository
git clone https://github.com/kelp/webdown.git
cd webdown
# Install dependencies with Poetry
poetry install
poetry run pre-commit install
# Optional: Start a Poetry shell for interactive development
poetry shell
Development Commands
We use a Makefile to streamline development tasks:
# Install dependencies
make install
# Run tests
make test
# Run tests with coverage
make test-coverage
# Run integration tests
make integration-test
# Run linting
make lint
# Run type checking
make type-check
# Format code
make format
# Run all pre-commit hooks
make pre-commit
# Run all checks (lint, type-check, test)
make all-checks
# Build package
make build
# Start interactive Poetry shell
make shell
# Generate documentation
make docs
# Start documentation server
make docs-serve
# Publishing to PyPI (maintainers only)
# See CONTRIBUTING.md for details on the release process
make build # Build package
make publish-test # Publish to TestPyPI (for testing)
# Show all available commands
make help
Poetry Commands
You can also use Poetry directly:
# Start an interactive shell in the Poetry environment
poetry shell
# Run a command in the Poetry environment
poetry run pytest
# Add a new dependency
poetry add requests
# Add a development dependency
poetry add --group dev black
# Update dependencies
poetry update
# Build package
poetry build
Python API Usage
Webdown can also be used as a Python library in your own projects:
from webdown.converter import convert_url_to_markdown, WebdownConfig
# Method 1: Basic conversion with individual parameters
markdown = convert_url_to_markdown("https://example.com")
# Method 1: With all options as parameters (original style)
markdown = convert_url_to_markdown(
url="https://example.com",
include_links=True,
include_images=True,
include_toc=True,
css_selector="main", # Only extract main content
compact_output=True, # Remove excessive blank lines
body_width=80, # Wrap text at 80 characters
show_progress=True # Show download progress bar
)
# Method 2: Using the Config object (new in 0.3.1)
config = WebdownConfig(
# Basic options
url="https://example.com",
include_toc=True,
css_selector="main",
compact_output=True,
body_width=80,
show_progress=True,
# Advanced options (all optional)
single_line_break=False,
unicode_snob=True, # Use Unicode characters
tables_as_html=False,
emphasis_mark="_",
strong_mark="**"
)
markdown = convert_url_to_markdown(config)
# Save to file
with open("output.md", "w") as f:
f.write(markdown)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Run tests to make sure everything works:
# Run standard tests poetry run pytest # Run tests with coverage poetry run pytest --cov=webdown # Run integration tests poetry run pytest --integration
- Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please make sure your code passes all tests, type checks, and follows our coding style (enforced by pre-commit hooks). We aim to maintain high code coverage (currently at 93%). When adding features, please include tests.
For more details, see CONTRIBUTING.md.
Support
If you encounter any problems or have feature requests, please open an issue on GitHub.
License
MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webdown-0.4.2.tar.gz.
File metadata
- Download URL: webdown-0.4.2.tar.gz
- Upload date:
- Size: 19.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0905195bbb420f6ec08aea0cd0fa5cfc08bffa606eb1b9924cfe6e52a00d7751
|
|
| MD5 |
71a834a6fd4f09fc9c31327f4c60827f
|
|
| BLAKE2b-256 |
549e95f76857f95f072e32c2ed1363e30bf9d3ecc4eb5b9333c62658d92a87a4
|
Provenance
The following attestation bundles were made for webdown-0.4.2.tar.gz:
Publisher:
python-release.yml on kelp/webdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
webdown-0.4.2.tar.gz -
Subject digest:
0905195bbb420f6ec08aea0cd0fa5cfc08bffa606eb1b9924cfe6e52a00d7751 - Sigstore transparency entry: 183026675
- Sigstore integration time:
-
Permalink:
kelp/webdown@fbd9d30baf4490c33af9b9a56b44a43227b42e09 -
Branch / Tag:
refs/tags/v0.4.2 - Owner: https://github.com/kelp
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-release.yml@fbd9d30baf4490c33af9b9a56b44a43227b42e09 -
Trigger Event:
push
-
Statement type:
File details
Details for the file webdown-0.4.2-py3-none-any.whl.
File metadata
- Download URL: webdown-0.4.2-py3-none-any.whl
- Upload date:
- Size: 20.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61fa8b0d48bef18639285217ec8418d00d99d93c56511a9bfbb005a91659a0f7
|
|
| MD5 |
5245fc4aee4454a4107c12770fcb93b7
|
|
| BLAKE2b-256 |
e244579916073c0c37cc89235799438bbeece99365ae99a09928dedb92e5c32d
|
Provenance
The following attestation bundles were made for webdown-0.4.2-py3-none-any.whl:
Publisher:
python-release.yml on kelp/webdown
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
webdown-0.4.2-py3-none-any.whl -
Subject digest:
61fa8b0d48bef18639285217ec8418d00d99d93c56511a9bfbb005a91659a0f7 - Sigstore transparency entry: 183026676
- Sigstore integration time:
-
Permalink:
kelp/webdown@fbd9d30baf4490c33af9b9a56b44a43227b42e09 -
Branch / Tag:
refs/tags/v0.4.2 - Owner: https://github.com/kelp
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-release.yml@fbd9d30baf4490c33af9b9a56b44a43227b42e09 -
Trigger Event:
push
-
Statement type: