Curate scraped HTML for easy interpretation by large language models. Build more robust generative AI applications. Convert HTML to Markdown using Regex, BeautifulSoup4, and filter out useless content with Jina Embeddings.
Reason this release was yanked:
leaving 0.1.32 as current
Project description
Convert and Format HTML to Markdown
Purpose
For converting HTML to Markdown and formatting a dataset of HTML content into structured Markdown, with added capabilities of processing text embeddings to identify and remove redundant content.
Installation & Setup
To get started, run:
pip install conv_html_to_markdown
- No API keys required
- Run
jina_embeddings.py
to preemptively download the embeddings model.
Example integration:
- Please see an example usage in gpt-crawler. This fork of
gpt-crawler
has theconv_html_to_markdown
package integrated into its processing pipeline.
Configuration:
- You can clone the package repository to configure similarity threshold for removing content, chunk size, maximum number of threads, the file pattern to match when loading files for conversion, and the output file's name.
git clone https://github.com/daethyra/conv_html_to_markdown.git
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for conv_html_to_markdown-0.1.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f02dea5b1c2b8492ca960b8e285573499388eac6ef875126a3d0da966ed5d39c |
|
MD5 | cbc8cb862e4e347f5360c0cf9e8ca729 |
|
BLAKE2b-256 | cfe873ae6ff22e26fb1f65cef81fd707cf49df196cf8f0f7000a2da81b8aad6e |
Close
Hashes for conv_html_to_markdown-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6f45c8af7a61cefd4c2b46a22cc30748c5be918f28fb93f2185d00a3b763e53b |
|
MD5 | 4fac57249ec2fb293154486b95e618bf |
|
BLAKE2b-256 | a4c40d42b8518a7978ac80dfd8284bb88097ca0d3d33ff27dd2c4c48b94ca1f9 |