Parsing, chunking, diffing, and diff filtering of text to support LLM applications
Project description
chopdiff
chopdiff is a small library of tools I've developed for use especially with
LLMs that let you handle Markdown and text document edits.
It aims to have minimal dependencies and be useful for various LLM applications where you want to manipulate text, Markdown, and lightweight (not fully parsed) HTML documents.
It offers support for:
-
Parsing of documents into sentences and paragraphs (by default using regex heuristics or using a sentence splitter of your choice, like Spacy).
-
Measure size and extract pieces of documents, using arbitrary units of paragraphs, sentences and indexing of these documents at the paragraph
-
Support for lightweight "chunking" of documents by wrappign paragraphs in named
<div>s to indicate chunks. -
Text-based diffing at the word level.
-
Filtering of text-based diffs based on specific criteria.
-
Transformation of documents via windows, then re-stitching the result.
All this is done very simply in memory, and with only regex or basic Markdown parsing to keep things simple and with few dependencies. This doesn't depend on anything
Example use cases:
-
Walk through a document N paragraphs, N sentences, or N chunks at a time, processing the results with an LLM call and recombining the result.
-
Ask an LLM to edit a transcript, only inserting paragraph breaks but enforcing that the LLM can't do anything except insert whitespace.
Development
For development workflows, see development.md.
This project was built from simple-modern-poetry.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chopdiff-0.1.0.tar.gz.
File metadata
- Download URL: chopdiff-0.1.0.tar.gz
- Upload date:
- Size: 26.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d145c9b9fe469e04b2edd086a8a2ac9a2c48b45939f22bfd95814fb9b23015fb
|
|
| MD5 |
eeec2519d9a7401c718fc1228d7db007
|
|
| BLAKE2b-256 |
cd79bbff659b528f2cad27475d93095ca9e94632559f68ff567acaa5ee8e7c26
|
Provenance
The following attestation bundles were made for chopdiff-0.1.0.tar.gz:
Publisher:
publish.yml on jlevy/chopdiff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chopdiff-0.1.0.tar.gz -
Subject digest:
d145c9b9fe469e04b2edd086a8a2ac9a2c48b45939f22bfd95814fb9b23015fb - Sigstore transparency entry: 175845379
- Sigstore integration time:
-
Permalink:
jlevy/chopdiff@258cd7fd382e5c9ede6c3a4d4adc59977dcee6f4 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jlevy
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@258cd7fd382e5c9ede6c3a4d4adc59977dcee6f4 -
Trigger Event:
release
-
Statement type:
File details
Details for the file chopdiff-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chopdiff-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b96c9693ed5809d2239e9f928826fd700a18af5127ec3442e5d43c81a00443eb
|
|
| MD5 |
8935c4066d37f3f333c4135675595683
|
|
| BLAKE2b-256 |
600142509f79af8c6406548ac5440935deb57b7b091f4a13b61c33fb7a45380f
|
Provenance
The following attestation bundles were made for chopdiff-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on jlevy/chopdiff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chopdiff-0.1.0-py3-none-any.whl -
Subject digest:
b96c9693ed5809d2239e9f928826fd700a18af5127ec3442e5d43c81a00443eb - Sigstore transparency entry: 175845381
- Sigstore integration time:
-
Permalink:
jlevy/chopdiff@258cd7fd382e5c9ede6c3a4d4adc59977dcee6f4 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jlevy
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@258cd7fd382e5c9ede6c3a4d4adc59977dcee6f4 -
Trigger Event:
release
-
Statement type: