Skip to main content

Curate scraped HTML for easy interpretation by large language models. Build more robust generative AI applications. Convert HTML to Markdown using Regex, BeautifulSoup4, and filter out useless content with Jina Embeddings.

Project description

Convert and Format HTML to Markdown

Purpose

For converting HTML to Markdown and formatting a dataset of HTML content into structured Markdown, with added capabilities of processing text embeddings to identify and remove redundant content.

  • No API keys required -> Because this project uses the open-source Jina Embeddings model locally, it's totally free to use.

Installation & Setup

First clone the package: git clone https://github.com/daethyra/conv_html_to_markdown.git

To get started, run:

pip install conv_html_to_markdown

  • Run jina_embeddings.py to preemptively download the embeddings model.

Example integration:

  • Please see an example usage in gpt-crawler. This fork of gpt-crawler has the conv_html_to_markdown package integrated into its processing pipeline.

Configuration:

  • You can clone the package repository to configure similarity threshold for removing content, chunk size, maximum number of threads, the file pattern to match when loading files for conversion, and the output file's name.

To do so, edit the parameter values passed into the main function in main.py.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

conv_html_to_markdown-0.1.32.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

conv_html_to_markdown-0.1.32-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file conv_html_to_markdown-0.1.32.tar.gz.

File metadata

File hashes

Hashes for conv_html_to_markdown-0.1.32.tar.gz
Algorithm Hash digest
SHA256 fdaef70a79342e433d325d6cce7d476655ca044c4f61a9563be25585c8cec432
MD5 ef2f9758b7ce68fb6f66a7e04686eda9
BLAKE2b-256 44b63c32d789b699a365e67a5b6fabe8d5986936f29ad953f30c1334d6659b5f

See more details on using hashes here.

File details

Details for the file conv_html_to_markdown-0.1.32-py3-none-any.whl.

File metadata

File hashes

Hashes for conv_html_to_markdown-0.1.32-py3-none-any.whl
Algorithm Hash digest
SHA256 4cb0ced9b845bbd06aee8cb44f6786845c51181792c4633807df8c1700da1839
MD5 f7a3f05fa6930611b67963cbfa5f9a5a
BLAKE2b-256 3fa4ccd7bf826f3abbe3b14df52bc633313f26de05bc02cff2f73dfe2747a9d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page