Skip to main content

Convert HTML to Markdown using Regex, BeautifulSoup4, and filter repeating characters with Jina Embeddings and a similarity threshold.

Project description

Convert and Format HTML to Markdown

Table of Contents:

Description

Extracts HTML content from a JSON file to produce a Markdown file. Leverages similarity threshold to remove redundant content.

Problem to Solve

Building retrieval augmented generation AI applications can be a lengthy process. While there are web crawlers to collect content, the post processing of this content is equally important for accurate and helpful generation.

This library was built specifically to augment the context-curator project by further automating the document creation process.

Quick Start | Getting Started

  1. Installation
  • To have access to the package in your local environment, clone the repository using git: git clone https://github.com/daethyra/context-converter.git

  • To install via pip, run: pip install context-converter

Optional: Run jina_embeddings.py to preemptively download the embeddings model.

  1. Navigate into the context-converter folder: cd context-converter

  2. Place a JSON file of HTML content into the same folder.

  3. Run python3 main.py

Your output file will be created in the same folder.

Configuration

You can tweak the similarity threshold and more to help yourself curate what you want.

i. In main.py, you can set the following parameters to optimize your results:

  • main.py
    • chunk_size: The size of the chunk to be processed. The default value is 256.
    • You can find speed tests here.

ii. In converter.py, you can set the following parameters to optimize your results:

  • converter.py
    • similarity.item(): The similarity threshold. The default value is 0.868899. Only similarity values above the threshold are removed, meaning a higher threshold removes less content. A lower threshold removes more content.
    • batch_size: Proccess embeddings for the given lines using batch processing. The default value is 16, which has proved to be faster than higher values, up to 256. Speed test results.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

context_converter-1.1.1.tar.gz (9.2 kB view hashes)

Uploaded Source

Built Distribution

context_converter-1.1.1-py3-none-any.whl (9.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page