Skip to main content

Parsing, chunking, diffing, and diff filtering of text to support LLM applications

Project description

chopdiff

chopdiff is a small library of tools I've developed for use especially with LLMs that let you handle Markdown and text document edits.

It aims to have minimal dependencies and be useful for various LLM applications where you want to manipulate text, Markdown, and lightweight (not fully parsed) HTML documents.

It offers support for:

  • Parsing of documents into sentences and paragraphs (by default using regex heuristics or using a sentence splitter of your choice, like Spacy).

  • Measure size and extract pieces of documents, using arbitrary units of paragraphs, sentences and indexing of these documents at the paragraph

  • Support for lightweight "chunking" of documents by wrappign paragraphs in named <div>s to indicate chunks.

  • Text-based diffing at the word level.

  • Filtering of text-based diffs based on specific criteria.

  • Transformation of documents via windows, then re-stitching the result.

All this is done very simply in memory, and with only regex or basic Markdown parsing to keep things simple and with few dependencies. This doesn't depend on anything

Example use cases:

  • Walk through a document N paragraphs, N sentences, or N chunks at a time, processing the results with an LLM call and recombining the result.

  • Ask an LLM to edit a transcript, only inserting paragraph breaks but enforcing that the LLM can't do anything except insert whitespace.

Development

For development workflows, see development.md.


This project was built from simple-modern-poetry.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chopdiff-0.1.0.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chopdiff-0.1.0-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file chopdiff-0.1.0.tar.gz.

File metadata

  • Download URL: chopdiff-0.1.0.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for chopdiff-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d145c9b9fe469e04b2edd086a8a2ac9a2c48b45939f22bfd95814fb9b23015fb
MD5 eeec2519d9a7401c718fc1228d7db007
BLAKE2b-256 cd79bbff659b528f2cad27475d93095ca9e94632559f68ff567acaa5ee8e7c26

See more details on using hashes here.

Provenance

The following attestation bundles were made for chopdiff-0.1.0.tar.gz:

Publisher: publish.yml on jlevy/chopdiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chopdiff-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chopdiff-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for chopdiff-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b96c9693ed5809d2239e9f928826fd700a18af5127ec3442e5d43c81a00443eb
MD5 8935c4066d37f3f333c4135675595683
BLAKE2b-256 600142509f79af8c6406548ac5440935deb57b7b091f4a13b61c33fb7a45380f

See more details on using hashes here.

Provenance

The following attestation bundles were made for chopdiff-0.1.0-py3-none-any.whl:

Publisher: publish.yml on jlevy/chopdiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page