Convert HTML to clean, structured Markdown or plain text

These details have not been verified by PyPI

Project links

Project description

html2cleantext

Convert HTML to clean, structured Markdown or plain text. Perfect for extracting readable content from web pages with robust boilerplate removal and language-aware processing.

Features

🧹 Smart Cleaning: Automatically removes navigation, footers, ads, and other boilerplate
📝 Flexible Output: Convert to Markdown or plain text
🌍 Language-Aware: Special support for Bengali and English with automatic language detection
🔗 Link Control: Choose to keep or remove links and images
🚀 Multiple Input Sources: Process HTML strings, files, or URLs
⚡ CLI & Python API: Use from command line or integrate into your Python projects
📦 Minimal Dependencies: Modern, lightweight dependency stack

Installation

pip install html2cleantext

Or install from source:

git clone https://github.com/Shawn-Imran/html2cleantext.git
cd html2cleantext
pip install -e .

Quick Start

Python API

import html2cleantext

# From HTML string
html = "<h1>Hello World</h1><p>This is a test with a <a href='https://example.com'>link</a>.</p>"
markdown = html2cleantext.to_markdown(html)  # Output: ... [link](https://example.com) ...
text = html2cleantext.to_text(html, keep_links=True)  # Output: ... link [Link:https://example.com] ...

# From file
markdown = html2cleantext.to_markdown("page.html")

# From URL
markdown = html2cleantext.to_markdown("https://example.com")

# With options
clean_text = html2cleantext.to_text(
    html,
    keep_links=True,  # Use [Link:URL] format in plain text
    keep_images=False,
    remove_boilerplate=True
)

Command Line Interface

# Convert to Markdown (default, links as [text](URL))
html2cleantext input.html

# Convert to plain text (links as [Link:URL])
html2cleantext input.html --mode text --keep-links

# From URL
html2cleantext https://example.com --output clean.md

# Remove links and images
html2cleantext input.html --no-links --no-images

# Keep all content (no boilerplate removal)
html2cleantext input.html --no-remove_boilerplate

API Reference

Core Functions

`to_markdown(html_input, **options)`

Convert HTML to clean Markdown format.

Parameters:

html_input (str|Path): HTML string, file path, or URL
keep_links (bool): Preserve links (default: True)
keep_images (bool): Preserve images (default: True)
remove_boilerplate (bool): Remove boilerplate content (default: True)
normalize_lang (bool): Apply language normalization (default: True)
language (str, optional): Language code for normalization (auto-detected if None)

Returns: Clean Markdown text (str)

`to_text(html_input, **options)`

Convert HTML to clean plain text format.

Parameters:

Same as to_markdown() but with different defaults:
keep_links (bool): Default False
keep_images (bool): Default False

Returns: Clean plain text (str)

CLI Options

positional arguments:
  input                 HTML input: file path, URL, or raw HTML string

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --mode {markdown,text}, -m {markdown,text}
                        Output format (default: markdown)
  --output OUTPUT, -o OUTPUT
                        Output file path (default: stdout)
  --keep-links          Preserve links in the output
  --no-links            Remove links from the output
  --keep-images         Preserve images in the output
  --no-images           Remove images from the output
  --remove_boilerplate   Remove navigation, footers, and boilerplate content
  --no-remove_boilerplate
                        Keep all content including navigation and footers
  --language LANGUAGE, -l LANGUAGE
                        Language code for normalization
  --no-normalize        Skip language-specific normalization
  --verbose, -v         Enable verbose logging

Link Output Format

Markdown output: Links are converted to standard Markdown format [text](URL) for compatibility with Markdown renderers.
Plain text and CLI output: Links are converted to [Link:URL] format (e.g., My Link [Link:https://example.com]) for easy parsing and clear distinction from other text.

Examples

Basic Usage

import html2cleantext

# Simple HTML to Markdown
html = """
<html>
<head><title>Test Page</title></head>
<body>
    <nav>Navigation menu</nav>
    <main>
        <h1>Main Title</h1>
        <p>This is the main content with a <a href=\"https://example.com\">link</a>.</p>
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
        </ul>
    </main>
    <footer>Footer content</footer>
</body>
</html>
"""

result_md = html2cleantext.to_markdown(html)
print(result_md)
# Output:
# Main Title
#
# This is the main content with a [link](https://example.com).
#
# * Item 1
# * Item 2

result_txt = html2cleantext.to_text(html, keep_links=True)
print(result_txt)
# Output:
# Main Title
#
# This is the main content with a link [Link:https://example.com].
#
# Item 1
# Item 2

Command Line Examples

# Basic conversion (Markdown, links as [text](URL))
html2cleantext index.html > clean.md

# Plain text with links as [Link:URL]
html2cleantext index.html --mode text --keep-links > clean.txt

Language Support

html2cleantext provides enhanced support for:

English: Smart quote normalization, punctuation cleanup
Bengali: Unicode normalization, punctuation handling
Auto-detection: Automatically detects language when not specified

Additional languages can be easily added by extending the normalization functions.

Architecture

The package follows a clean pipeline architecture:

Input Processing: Handles HTML strings, files, or URLs
HTML Parsing: Uses BeautifulSoup with lxml parser
Cleaning: Removes scripts, styles, and unwanted attributes
Boilerplate Removal: Strips navigation, footers, ads using readability-lxml or manual rules
Language Detection: Auto-detects content language
Conversion: Converts to Markdown using markdownify or extracts plain text
Normalization: Applies language-specific text cleanup
Output: Returns clean text or writes to file

Dependencies

beautifulsoup4 - HTML parsing
lxml - Fast XML/HTML parser
markdownify - HTML to Markdown conversion
readability-lxml - Content extraction and boilerplate removal
langdetect - Language detection
requests - HTTP requests for URL fetching

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

git clone https://github.com/Shawn-Imran/html2cleantext.git
cd html2cleantext
pip install -e .[dev]  # Install with development dependencies
# OR
pip install -e .  # Install package only
pip install -r requirements-dev.txt  # Install dev dependencies separately

Running Tests

python -m pytest tests/

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

v0.1.0

Initial release
Core HTML to Markdown/text conversion
Boilerplate removal using readability-lxml
Language-aware normalization for Bengali and English
Command-line interface
Support for HTML strings, files, and URLs

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.6

Sep 8, 2025

0.1.5

Sep 7, 2025

0.1.4

Sep 4, 2025

0.1.3

Sep 2, 2025

0.1.2

Sep 2, 2025

0.1.1

Sep 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2cleantext-0.1.6.tar.gz (25.5 kB view details)

Uploaded Sep 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

html2cleantext-0.1.6-py3-none-any.whl (17.7 kB view details)

Uploaded Sep 8, 2025 Python 3

File details

Details for the file html2cleantext-0.1.6.tar.gz.

File metadata

Download URL: html2cleantext-0.1.6.tar.gz
Upload date: Sep 8, 2025
Size: 25.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for html2cleantext-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`42b88f0a001450c5526d2c8ca201ed0be61edef9ac5f35802368cfb13179be7b`
MD5	`6e92075f34544c130758d62ce94c9d5e`
BLAKE2b-256	`18dce9c015127cb188178a8d6d4c32778a1810ef0508a43e25a51a47622d9ed2`

See more details on using hashes here.

File details

Details for the file html2cleantext-0.1.6-py3-none-any.whl.

File metadata

Download URL: html2cleantext-0.1.6-py3-none-any.whl
Upload date: Sep 8, 2025
Size: 17.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for html2cleantext-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a6a9bbedc0f2979010eb40a033fdcdd9a257ff072f2e02465cb910ba4bb5f965`
MD5	`c370ea3fdab576f09e45364339c33c5a`
BLAKE2b-256	`b37e3651a465f88fd54e24c15306788f1eb132f8a0d0bf6b2896785db3db5e9e`

See more details on using hashes here.

html2cleantext 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

html2cleantext

Features

Installation

Quick Start

Python API

Command Line Interface

API Reference

Core Functions

to_markdown(html_input, **options)

to_text(html_input, **options)

CLI Options

Link Output Format

Examples

Basic Usage

Command Line Examples

Language Support

Architecture

Dependencies

Contributing

Development Setup

Running Tests

License

Changelog

v0.1.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`to_markdown(html_input, **options)`

`to_text(html_input, **options)`