A Python library for converting HTML to semantic Markdown
Project description
Domscribe: Semantic Markdown Converter
Warning: This is an alpha version of Domscribe. Some tests are still failing, and the API may change in future releases. Use with caution in production environments.
This Python library is a semi-automated port of dom-to-semantic-markdown. It converts HTML to semantic Markdown, preserving the structure and meaning of the original content.
Installation
pip install domscribe
Usage
from domscribe import html_to_markdown
html = "<h1>Hello, World!</h1><p>This is a <strong>test</strong>.</p>"
markdown = html_to_markdown(html)
print(markdown)
For more advanced usage, you can pass options to customize the conversion:
options = {
'extract_main_content': True,
'keep_html': ['div', 'span'],
'refify_urls': True
}
markdown = html_to_markdown(html, options)
Why Domscribe?
Domscribe aims to solve several problems associated with traditional HTML-to-Markdown converters:
-
Semantic preservation: Most converters lose important semantic information during conversion. Domscribe maintains the semantic structure of the original HTML.
-
Handling complex structures: Traditional converters often struggle with nested lists, tables, and other complex HTML structures. Domscribe handles these with ease.
-
Customizability: Domscribe offers various options to customize the conversion process according to your needs.
-
Main content extraction: It can automatically identify and extract the main content of a web page, ignoring navigation, footers, and other peripheral content.
-
LLM-friendly output: The generated Markdown is optimized for further processing by Language Models (LLMs), including special annotations for table columns.
Customizable Conversion Options
Domscribe offers several options to customize the conversion process:
extract_main_content
: Automatically identify and extract the main content of a web page.keep_html
: Preserve specified HTML tags in the Markdown output.refify_urls
: Convert URLs to reference-style links for improved readability.include_meta_data
: Include metadata from the HTML head in the Markdown output.debug
: Enable debug logging for troubleshooting.
For example, to extract the main content and preserve the div
and span
tags, you can use the following options:
options = {
'extract_main_content': True,
'keep_html': ['div', 'span']
}
converted_html = html_to_markdown(html, options)
URL Refactoring
The refify_urls
option allows you to convert inline URLs to reference-style links, improving the readability of the generated Markdown. This feature is particularly useful for documents with many links or long URLs.
For example:
Check out [this link][1] and [another link][2].
Here's [a repeated link][1].
[1]: https://www.example.com
[2]: https://www.anotherexample.com
Semantic Content Extraction
Domscribe can automatically detect and extract the main content of a web page. This feature helps in focusing on the most relevant part of the HTML document, ignoring navigation, footers, and other peripheral content. To use this feature, set the extract_main_content
option to True
:
options = {'extract_main_content': True}
markdown = html_to_markdown(html, options)
The library uses various heuristics to identify the main content, including:
- Checking for
<main>
tags - Analyzing element attributes like 'id' and 'class'
- Evaluating the density of text and other content
Preserving Semantic HTML
Domscribe can preserve certain HTML tags that carry semantic meaning, even in Markdown output. This is useful for maintaining the structure and semantics of the original content. To enable this feature, use the keep_html
option:
options = {'keep_html': ['div', 'span']}
markdown = html_to_markdown(html, options)
Table Column Identifiers
When converting tables, Domscribe adds special comments to help identify columns:
| Header 1 <!-- colId: 1 --> | Header 2 <!-- colId: 2 --> |
| --- | --- |
| Row 1, Cell 1 <!-- colId: 1 --> | Row 1, Cell 2 <!-- colId: 2 --> |
These <!-- colId: n -->
comments are designed to assist Language Models (LLMs) in understanding the structure of the table, making it easier to process and manipulate table data programmatically.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file domscribe-0.1.3.tar.gz
.
File metadata
- Download URL: domscribe-0.1.3.tar.gz
- Upload date:
- Size: 13.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0dbe4791832e7134704762ce1c7ef9f17f696d45ec25fbd1f56c0a1d57a5e6ac |
|
MD5 | 9e1419de477bd50a4ba3ce76c18d71f7 |
|
BLAKE2b-256 | 070a87a6c4d7ae07750f99eb11c3ca9bedb56f851c733650758a9d9b5da55f3f |
File details
Details for the file domscribe-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: domscribe-0.1.3-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0ce811e2315bc165a3129cbe874a655b016a79f322a9b5630b409eff1ed42fe |
|
MD5 | 62709b7dae59579c764168e14f022dba |
|
BLAKE2b-256 | 9211afa40cf489654944244615dbda30b3e92df520f7ef11be85d8c2320993e0 |