Convert HTML to Markdown using Regex, BeautifulSoup4, and filter repeating characters with Jina Embeddings and a similarity threshold.
Project description
Convert and Format HTML to Markdown
Table of Contents:
Description
Extracts HTML content from a JSON file to produce a Markdown file. Leverages similarity threshold to remove redundant content.
Problem to Solve
Building retrieval augmented generation AI applications can be a lengthy process. While there are web crawlers to collect content, the post processing of this content is equally important for accurate and helpful generation.
This library was built specifically to augment the context-curator project by further automating the document creation process.
Quick Start | Getting Started
- Installation
-
To have access to the package in your local environment, clone the repository using
git
:git clone https://github.com/daethyra/context-converter.git
-
To install via pip, run:
pip install context-converter
Optional: Run jina_embeddings.py
to preemptively download the embeddings model.
-
Navigate into the
context-converter
folder:cd context-converter
-
Place a JSON file of HTML content into the same folder.
-
Run
python3 main.py
Your output file will be created in the same folder.
Configuration
You can tweak the similarity threshold and more to help yourself curate what you want.
i. In main.py, you can set the following parameters to optimize your results:
main.py
- chunk_size: The size of the chunk to be processed. The default value is 256.
- You can find speed tests here.
ii. In converter.py, you can set the following parameters to optimize your results:
converter.py
similarity.item()
: The similarity threshold. The default value is 0.868899. Only similarity values above the threshold are removed, meaning a higher threshold removes less content. A lower threshold removes more content.batch_size
: Proccess embeddings for the given lines using batch processing. The default value is 16, which has proved to be faster than higher values, up to 256. Speed test results.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for context_converter-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6b2573a0b94871b798cabf4bb112661a34a908b062412a6cae63e553e5e35ed |
|
MD5 | 03594d171abec7a057abdc0b183fdc3e |
|
BLAKE2b-256 | 3387cc44c8eb2ec00ea888559555619ca5970a039c40a361a76b6b726ae973aa |