Skip to main content

A Python library for converting HTML to semantic Markdown

Project description

Domscribe: Semantic Markdown Converter

Python Tests

Warning: This is an alpha version of Domscribe. Some tests are still failing, and the API may change in future releases. Use with caution in production environments.

This Python library is a semi-automated port of dom-to-semantic-markdown. It converts HTML to semantic Markdown, preserving the structure and meaning of the original content.

Installation

pip install domscribe

Usage

from domscribe import html_to_markdown

html = "<h1>Hello, World!</h1><p>This is a <strong>test</strong>.</p>"
markdown = html_to_markdown(html)
print(markdown)

For more advanced usage, you can pass options to customize the conversion:

options = {
    'extract_main_content': True,
    'keep_html': ['div', 'span'],
    'refify_urls': True
}
markdown = html_to_markdown(html, options)

Why Domscribe?

Domscribe aims to solve several problems associated with traditional HTML-to-Markdown converters:

  1. Semantic preservation: Most converters lose important semantic information during conversion. Domscribe maintains the semantic structure of the original HTML.

  2. Handling complex structures: Traditional converters often struggle with nested lists, tables, and other complex HTML structures. Domscribe handles these with ease.

  3. Customizability: Domscribe offers various options to customize the conversion process according to your needs.

  4. Main content extraction: It can automatically identify and extract the main content of a web page, ignoring navigation, footers, and other peripheral content.

  5. LLM-friendly output: The generated Markdown is optimized for further processing by Language Models (LLMs), including special annotations for table columns.

Customizable Conversion Options

Domscribe offers several options to customize the conversion process:

  • extract_main_content: Automatically identify and extract the main content of a web page.
  • keep_html: Preserve specified HTML tags in the Markdown output.
  • refify_urls: Convert URLs to reference-style links for improved readability.
  • include_meta_data: Include metadata from the HTML head in the Markdown output.
  • debug: Enable debug logging for troubleshooting.

For example, to extract the main content and preserve the div and span tags, you can use the following options:

options = {
    'extract_main_content': True,
    'keep_html': ['div', 'span']
}
converted_html = html_to_markdown(html, options)

URL Refactoring

The refify_urls option allows you to convert inline URLs to reference-style links, improving the readability of the generated Markdown. This feature is particularly useful for documents with many links or long URLs.

For example:

Check out [this link][1] and [another link][2].

Here's [a repeated link][1].

[1]: https://www.example.com
[2]: https://www.anotherexample.com

Semantic Content Extraction

Domscribe can automatically detect and extract the main content of a web page. This feature helps in focusing on the most relevant part of the HTML document, ignoring navigation, footers, and other peripheral content. To use this feature, set the extract_main_content option to True:

options = {'extract_main_content': True}
markdown = html_to_markdown(html, options)

The library uses various heuristics to identify the main content, including:

  • Checking for <main> tags
  • Analyzing element attributes like 'id' and 'class'
  • Evaluating the density of text and other content

Preserving Semantic HTML

Domscribe can preserve certain HTML tags that carry semantic meaning, even in Markdown output. This is useful for maintaining the structure and semantics of the original content. To enable this feature, use the keep_html option:

options = {'keep_html': ['div', 'span']}
markdown = html_to_markdown(html, options)

Table Column Identifiers

When converting tables, Domscribe adds special comments to help identify columns:

| Header 1 <!-- colId: 1 --> | Header 2 <!-- colId: 2 --> |
| --- | --- |
| Row 1, Cell 1 <!-- colId: 1 --> | Row 1, Cell 2 <!-- colId: 2 --> |

These <!-- colId: n --> comments are designed to assist Language Models (LLMs) in understanding the structure of the table, making it easier to process and manipulate table data programmatically.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domscribe-0.1.3.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

domscribe-0.1.3-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file domscribe-0.1.3.tar.gz.

File metadata

  • Download URL: domscribe-0.1.3.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for domscribe-0.1.3.tar.gz
Algorithm Hash digest
SHA256 0dbe4791832e7134704762ce1c7ef9f17f696d45ec25fbd1f56c0a1d57a5e6ac
MD5 9e1419de477bd50a4ba3ce76c18d71f7
BLAKE2b-256 070a87a6c4d7ae07750f99eb11c3ca9bedb56f851c733650758a9d9b5da55f3f

See more details on using hashes here.

File details

Details for the file domscribe-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: domscribe-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for domscribe-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b0ce811e2315bc165a3129cbe874a655b016a79f322a9b5630b409eff1ed42fe
MD5 62709b7dae59579c764168e14f022dba
BLAKE2b-256 9211afa40cf489654944244615dbda30b3e92df520f7ef11be85d8c2320993e0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page