Convert HTML to clean, structured Markdown or plain text
Project description
html2cleantext
Convert HTML to clean, structured Markdown or plain text. Perfect for extracting readable content from web pages with robust boilerplate removal and language-aware processing.
Features
- 🧹 Smart Cleaning: Automatically removes navigation, footers, ads, and other boilerplate
- 📝 Flexible Output: Convert to Markdown or plain text
- 🌍 Language-Aware: Special support for Bengali and English with automatic language detection
- 🔗 Link Control: Choose to keep or remove links and images
- 🚀 Multiple Input Sources: Process HTML strings, files, or URLs
- ⚡ CLI & Python API: Use from command line or integrate into your Python projects
- 📦 Minimal Dependencies: Modern, lightweight dependency stack
Installation
pip install html2cleantext
Or install from source:
git clone https://github.com/Shawn-Imran/html2cleantext.git
cd html2cleantext
pip install -e .
Quick Start
Python API
import html2cleantext
# From HTML string
html = "<h1>Hello World</h1><p>This is a test with a <a href='https://example.com'>link</a>.</p>"
markdown = html2cleantext.to_markdown(html) # Output: ... [link](https://example.com) ...
text = html2cleantext.to_text(html, keep_links=True) # Output: ... link [Link:https://example.com] ...
# From file
markdown = html2cleantext.to_markdown("page.html")
# From URL
markdown = html2cleantext.to_markdown("https://example.com")
# With options
clean_text = html2cleantext.to_text(
html,
keep_links=True, # Use [Link:URL] format in plain text
keep_images=False,
remove_boilerplate=True
)
Command Line Interface
# Convert to Markdown (default, links as [text](URL))
html2cleantext input.html
# Convert to plain text (links as [Link:URL])
html2cleantext input.html --mode text --keep-links
# From URL
html2cleantext https://example.com --output clean.md
# Remove links and images
html2cleantext input.html --no-links --no-images
# Keep all content (no boilerplate removal)
html2cleantext input.html --no-remove_boilerplate
API Reference
Core Functions
to_markdown(html_input, **options)
Convert HTML to clean Markdown format.
Parameters:
html_input(str|Path): HTML string, file path, or URLkeep_links(bool): Preserve links (default: True)keep_images(bool): Preserve images (default: True)remove_boilerplate(bool): Remove boilerplate content (default: True)normalize_lang(bool): Apply language normalization (default: True)language(str, optional): Language code for normalization (auto-detected if None)
Returns: Clean Markdown text (str)
to_text(html_input, **options)
Convert HTML to clean plain text format.
Parameters:
- Same as
to_markdown()but with different defaults: keep_links(bool): Default Falsekeep_images(bool): Default False
Returns: Clean plain text (str)
CLI Options
positional arguments:
input HTML input: file path, URL, or raw HTML string
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--mode {markdown,text}, -m {markdown,text}
Output format (default: markdown)
--output OUTPUT, -o OUTPUT
Output file path (default: stdout)
--keep-links Preserve links in the output
--no-links Remove links from the output
--keep-images Preserve images in the output
--no-images Remove images from the output
--remove_boilerplate Remove navigation, footers, and boilerplate content
--no-remove_boilerplate
Keep all content including navigation and footers
--language LANGUAGE, -l LANGUAGE
Language code for normalization
--no-normalize Skip language-specific normalization
--verbose, -v Enable verbose logging
Link Output Format
- Markdown output: Links are converted to standard Markdown format
[text](URL)for compatibility with Markdown renderers. - Plain text and CLI output: Links are converted to
[Link:URL]format (e.g.,My Link [Link:https://example.com]) for easy parsing and clear distinction from other text.
Examples
Basic Usage
import html2cleantext
# Simple HTML to Markdown
html = """
<html>
<head><title>Test Page</title></head>
<body>
<nav>Navigation menu</nav>
<main>
<h1>Main Title</h1>
<p>This is the main content with a <a href=\"https://example.com\">link</a>.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</main>
<footer>Footer content</footer>
</body>
</html>
"""
result_md = html2cleantext.to_markdown(html)
print(result_md)
# Output:
# Main Title
#
# This is the main content with a [link](https://example.com).
#
# * Item 1
# * Item 2
result_txt = html2cleantext.to_text(html, keep_links=True)
print(result_txt)
# Output:
# Main Title
#
# This is the main content with a link [Link:https://example.com].
#
# Item 1
# Item 2
Command Line Examples
# Basic conversion (Markdown, links as [text](URL))
html2cleantext index.html > clean.md
# Plain text with links as [Link:URL]
html2cleantext index.html --mode text --keep-links > clean.txt
Language Support
html2cleantext provides enhanced support for:
- English: Smart quote normalization, punctuation cleanup
- Bengali: Unicode normalization, punctuation handling
- Auto-detection: Automatically detects language when not specified
Additional languages can be easily added by extending the normalization functions.
Architecture
The package follows a clean pipeline architecture:
- Input Processing: Handles HTML strings, files, or URLs
- HTML Parsing: Uses BeautifulSoup with lxml parser
- Cleaning: Removes scripts, styles, and unwanted attributes
- Boilerplate Removal: Strips navigation, footers, ads using readability-lxml or manual rules
- Language Detection: Auto-detects content language
- Conversion: Converts to Markdown using markdownify or extracts plain text
- Normalization: Applies language-specific text cleanup
- Output: Returns clean text or writes to file
Dependencies
beautifulsoup4- HTML parsinglxml- Fast XML/HTML parsermarkdownify- HTML to Markdown conversionreadability-lxml- Content extraction and boilerplate removallangdetect- Language detectionrequests- HTTP requests for URL fetching
Contributing
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
Development Setup
git clone https://github.com/Shawn-Imran/html2cleantext.git
cd html2cleantext
pip install -e .[dev] # Install with development dependencies
# OR
pip install -e . # Install package only
pip install -r requirements-dev.txt # Install dev dependencies separately
Running Tests
python -m pytest tests/
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
v0.1.0
- Initial release
- Core HTML to Markdown/text conversion
- Boilerplate removal using readability-lxml
- Language-aware normalization for Bengali and English
- Command-line interface
- Support for HTML strings, files, and URLs
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file html2cleantext-0.1.6.tar.gz.
File metadata
- Download URL: html2cleantext-0.1.6.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42b88f0a001450c5526d2c8ca201ed0be61edef9ac5f35802368cfb13179be7b
|
|
| MD5 |
6e92075f34544c130758d62ce94c9d5e
|
|
| BLAKE2b-256 |
18dce9c015127cb188178a8d6d4c32778a1810ef0508a43e25a51a47622d9ed2
|
File details
Details for the file html2cleantext-0.1.6-py3-none-any.whl.
File metadata
- Download URL: html2cleantext-0.1.6-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6a9bbedc0f2979010eb40a033fdcdd9a257ff072f2e02465cb910ba4bb5f965
|
|
| MD5 |
c370ea3fdab576f09e45364339c33c5a
|
|
| BLAKE2b-256 |
b37e3651a465f88fd54e24c15306788f1eb132f8a0d0bf6b2896785db3db5e9e
|