Convert HTML to Markdown using Regex, BeautifulSoup4, and filter useless content with Jina Embeddings.

Project description

Convert and Format HTML to Markdown

Purpose

For converting HTML to Markdown and formatting a dataset of HTML content into structured Markdown, with added capabilities of processing text embeddings to identify and remove redundant content.

Installation & Setup

To get started, run:

pip install conv_html_to_markdown

No API keys required
Run jina_embeddings.py to preemptively download the embeddings model.

Example integration:

Please see an example usage in gpt-crawler. This fork of gpt-crawler has the conv_html_to_markdown package integrated into its processing pipeline.

Configuration:

You can clone the package repository to configure similarity threshold for removing content, chunk size, maximum number of threads, the file pattern to match when loading files for conversion, and the output file's name.

git clone https://github.com/daethyra/conv_html_to_markdown.git

Project details

Release history Release notifications | RSS feed

0.1.311 yanked

Jan 17, 2024

Reason this release was yanked:

0.1.32 is the same version, but preemptive corrections, like fixing relative imports

0.1.32

Jan 17, 2024

0.1.31 yanked

Jan 17, 2024

Reason this release was yanked:

leaving 0.1.32 as current

0.1.3 yanked

Jan 17, 2024

Reason this release was yanked:

leaving 0.1.32 as current

This version

0.1.2

Jan 1, 2024

0.1.1

Dec 29, 2023

0.1.0 yanked

Dec 29, 2023

Reason this release was yanked:

unsure if stable. 0.1.2 is *definitely* stable, plus more efficient over version prior to it

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conv_html_to_markdown-0.1.2.tar.gz (7.6 kB view hashes)

Uploaded Jan 1, 2024 Source

Built Distribution

conv_html_to_markdown-0.1.2-py3-none-any.whl (7.8 kB view hashes)

Uploaded Jan 1, 2024 Python 3

Hashes for conv_html_to_markdown-0.1.2.tar.gz

Hashes for conv_html_to_markdown-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`b6dfe443f8471dbc1961d1d3090b8ad37c83b3dc423bd80bf33b51a88d7c2a76`
MD5	`7fccd4cef84fec7ddb6cdeb67c3b547a`
BLAKE2b-256	`6556cf528721cd857fa3da90f6ffd5c197e5a481a2610c6524b0dc9a393c0b52`

Hashes for conv_html_to_markdown-0.1.2-py3-none-any.whl

Hashes for conv_html_to_markdown-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7c332f53729d41a625224f269b27464ffe0af8fd39fb8c2b9fae48446e74064`
MD5	`a28f205d4b268629708b898a4687240c`
BLAKE2b-256	`c0498b6a486c177d00cfeb7e97fb111ae176c6c7a8dec2d3255b83609d0844e1`