Skip to main content

Convert HTML to clean, structured Markdown or plain text

Project description

html2cleantext

Convert HTML to clean, structured Markdown or plain text. Perfect for extracting readable content from web pages with robust boilerplate removal and language-aware processing.

Features

  • 🧹 Smart Cleaning: Automatically removes navigation, footers, ads, and other boilerplate
  • 📝 Flexible Output: Convert to Markdown or plain text
  • 🌍 Language-Aware: Special support for Bengali and English with automatic language detection
  • 🔗 Link Control: Choose to keep or remove links and images
  • 🚀 Multiple Input Sources: Process HTML strings, files, or URLs
  • CLI & Python API: Use from command line or integrate into your Python projects
  • 📦 Minimal Dependencies: Modern, lightweight dependency stack

Installation

pip install html2cleantext

Or install from source:

git clone https://github.com/Shawn-Imran/html2cleantext.git
cd html2cleantext
pip install -e .

Quick Start

Python API

import html2cleantext

# From HTML string
html = "<h1>Hello World</h1><p>This is a test with a <a href='https://example.com'>link</a>.</p>"
markdown = html2cleantext.to_markdown(html)  # Output: ... [link](https://example.com) ...
text = html2cleantext.to_text(html, keep_links=True)  # Output: ... link [Link:https://example.com] ...

# From file
markdown = html2cleantext.to_markdown("page.html")

# From URL
markdown = html2cleantext.to_markdown("https://example.com")

# With options
clean_text = html2cleantext.to_text(
    html,
    keep_links=True,  # Use [Link:URL] format in plain text
    keep_images=False,
    remove_boilerplate=True
)

Command Line Interface

# Convert to Markdown (default, links as [text](URL))
html2cleantext input.html

# Convert to plain text (links as [Link:URL])
html2cleantext input.html --mode text --keep-links

# From URL
html2cleantext https://example.com --output clean.md

# Remove links and images
html2cleantext input.html --no-links --no-images

# Keep all content (no boilerplate removal)
html2cleantext input.html --no-remove_boilerplate

API Reference

Core Functions

to_markdown(html_input, **options)

Convert HTML to clean Markdown format.

Parameters:

  • html_input (str|Path): HTML string, file path, or URL
  • keep_links (bool): Preserve links (default: True)
  • keep_images (bool): Preserve images (default: True)
  • remove_boilerplate (bool): Remove boilerplate content (default: True)
  • normalize_lang (bool): Apply language normalization (default: True)
  • language (str, optional): Language code for normalization (auto-detected if None)

Returns: Clean Markdown text (str)

to_text(html_input, **options)

Convert HTML to clean plain text format.

Parameters:

  • Same as to_markdown() but with different defaults:
  • keep_links (bool): Default False
  • keep_images (bool): Default False

Returns: Clean plain text (str)

CLI Options

positional arguments:
  input                 HTML input: file path, URL, or raw HTML string

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --mode {markdown,text}, -m {markdown,text}
                        Output format (default: markdown)
  --output OUTPUT, -o OUTPUT
                        Output file path (default: stdout)
  --keep-links          Preserve links in the output
  --no-links            Remove links from the output
  --keep-images         Preserve images in the output
  --no-images           Remove images from the output
  --remove_boilerplate   Remove navigation, footers, and boilerplate content
  --no-remove_boilerplate
                        Keep all content including navigation and footers
  --language LANGUAGE, -l LANGUAGE
                        Language code for normalization
  --no-normalize        Skip language-specific normalization
  --verbose, -v         Enable verbose logging

Link Output Format

  • Markdown output: Links are converted to standard Markdown format [text](URL) for compatibility with Markdown renderers.
  • Plain text and CLI output: Links are converted to [Link:URL] format (e.g., My Link [Link:https://example.com]) for easy parsing and clear distinction from other text.

Examples

Basic Usage

import html2cleantext

# Simple HTML to Markdown
html = """
<html>
<head><title>Test Page</title></head>
<body>
    <nav>Navigation menu</nav>
    <main>
        <h1>Main Title</h1>
        <p>This is the main content with a <a href=\"https://example.com\">link</a>.</p>
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
        </ul>
    </main>
    <footer>Footer content</footer>
</body>
</html>
"""

result_md = html2cleantext.to_markdown(html)
print(result_md)
# Output:
# Main Title
#
# This is the main content with a [link](https://example.com).
#
# * Item 1
# * Item 2

result_txt = html2cleantext.to_text(html, keep_links=True)
print(result_txt)
# Output:
# Main Title
#
# This is the main content with a link [Link:https://example.com].
#
# Item 1
# Item 2

Command Line Examples

# Basic conversion (Markdown, links as [text](URL))
html2cleantext index.html > clean.md

# Plain text with links as [Link:URL]
html2cleantext index.html --mode text --keep-links > clean.txt

Language Support

html2cleantext provides enhanced support for:

  • English: Smart quote normalization, punctuation cleanup
  • Bengali: Unicode normalization, punctuation handling
  • Auto-detection: Automatically detects language when not specified

Additional languages can be easily added by extending the normalization functions.

Architecture

The package follows a clean pipeline architecture:

  1. Input Processing: Handles HTML strings, files, or URLs
  2. HTML Parsing: Uses BeautifulSoup with lxml parser
  3. Cleaning: Removes scripts, styles, and unwanted attributes
  4. Boilerplate Removal: Strips navigation, footers, ads using readability-lxml or manual rules
  5. Language Detection: Auto-detects content language
  6. Conversion: Converts to Markdown using markdownify or extracts plain text
  7. Normalization: Applies language-specific text cleanup
  8. Output: Returns clean text or writes to file

Dependencies

  • beautifulsoup4 - HTML parsing
  • lxml - Fast XML/HTML parser
  • markdownify - HTML to Markdown conversion
  • readability-lxml - Content extraction and boilerplate removal
  • langdetect - Language detection
  • requests - HTTP requests for URL fetching

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

git clone https://github.com/Shawn-Imran/html2cleantext.git
cd html2cleantext
pip install -e .[dev]  # Install with development dependencies
# OR
pip install -e .  # Install package only
pip install -r requirements-dev.txt  # Install dev dependencies separately

Running Tests

python -m pytest tests/

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

v0.1.0

  • Initial release
  • Core HTML to Markdown/text conversion
  • Boilerplate removal using readability-lxml
  • Language-aware normalization for Bengali and English
  • Command-line interface
  • Support for HTML strings, files, and URLs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2cleantext-0.1.6.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html2cleantext-0.1.6-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file html2cleantext-0.1.6.tar.gz.

File metadata

  • Download URL: html2cleantext-0.1.6.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for html2cleantext-0.1.6.tar.gz
Algorithm Hash digest
SHA256 42b88f0a001450c5526d2c8ca201ed0be61edef9ac5f35802368cfb13179be7b
MD5 6e92075f34544c130758d62ce94c9d5e
BLAKE2b-256 18dce9c015127cb188178a8d6d4c32778a1810ef0508a43e25a51a47622d9ed2

See more details on using hashes here.

File details

Details for the file html2cleantext-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: html2cleantext-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for html2cleantext-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a6a9bbedc0f2979010eb40a033fdcdd9a257ff072f2e02465cb910ba4bb5f965
MD5 c370ea3fdab576f09e45364339c33c5a
BLAKE2b-256 b37e3651a465f88fd54e24c15306788f1eb132f8a0d0bf6b2896785db3db5e9e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page