Split text into semantically coherent, LLM-categorized chunks

Project description

chunklabel

A Python library for splitting text into categorized chunks using an LLM.

Overview

chunklabel segments text into semantically coherent spans, assigning a free-form category to each. Categories are named by the LLM without a predefined schema. Each chunk's quote is a verbatim excerpt from the source text, aligned back to the original after LLM output.

from chunklabel import ChunkLabeler

labeler = ChunkLabeler()
chunks = labeler.split(
    "The project kicked off in January with a small team. "
    "Budget constraints forced a scope reduction in March. "
    "Despite the setbacks, the product launched successfully in June."
)

# [
#   Chunk(category="initiation", quote="The project kicked off in January with a small team", start=0,   end=51),
#   Chunk(category="obstacle",   quote="Budget constraints forced a scope reduction in March", start=53,  end=104),
#   Chunk(category="outcome",    quote="the product launched successfully in June", start=120, end=160),
# ]

Installation

pip install chunklabel

For in-process inference with llama.cpp:

pip install "chunklabel[llamacpp]"

Data structures

The LLM returns raw chunks without span information. Alignment is performed as a separate step, producing the final Chunk with character-level positions.

# Intermediate: LLM output
@dataclass
class RawChunk:
    category: str   # Free-form category name assigned by the LLM
    quote: str      # Verbatim excerpt (may contain minor transcription noise)

# Final: after alignment
@dataclass
class Chunk:
    category: str   # Same as RawChunk
    quote: str      # Excerpt aligned to source text
    start: int      # Start index in source text
    end: int        # End index in source text

Pipeline

Input text
     │
     ▼
LLM  →  [{category, quote}, ...]   (RawChunk list)
     │
     ▼
rapidfuzz alignment  →  (start, end) resolved per chunk
     │
     ▼
Span post-processing  (lenient mode)
     │  gap-filling / overlap resolution
     ▼
Chunk list

Lenient mode

Gaps: unassigned spans between chunks are filled automatically as category="uncategorized"
Overlaps: the earlier chunk takes priority; the later chunk's start is pushed forward

Category normalization (offline)

After processing multiple texts, category names can drift across runs. A dedicated normalization step lets the LLM consolidate them in batch.

from chunklabel import Normalizer

normalizer = Normalizer()
normalizer.build_mapping(all_chunks)
# {"kick-off": "initiation", "project start": "initiation", "blocker": "obstacle", ...}

normalized_chunks = normalizer.apply(all_chunks)

The mapping is stored internally after build_mapping, so it can be passed to apply implicitly. To reuse the mapping across runs without calling the LLM again:

# Save after building
normalizer.save("mapping.json")

# Restore later
normalizer = Normalizer.load("mapping.json")
normalized_chunks = normalizer.apply(all_chunks)

Normalization runs offline over the full category inventory, so the LLM can make globally consistent decisions rather than local ones.

Configuration

labeler = ChunkLabeler(
    client="gpt-4o",     # model name string, or a BaseLLMClient instance
    fuzzy_threshold=80,  # match threshold for rapidfuzz alignment (0–100)
)

Using local LLMs

llama.cpp (in-process)

from llama_cpp import Llama
from chunklabel import ChunkLabeler
from chunklabel.llm import LlamaCppClient

client = LlamaCppClient(Llama(model_path="path/to/model.gguf", n_ctx=4096))
labeler = ChunkLabeler(client=client)

OpenAI-compatible server (e.g. llama.cpp server, Ollama)

Set OPENAI_BASE_URL before constructing the client:

OPENAI_BASE_URL=http://localhost:8080/v1 python your_script.py

from chunklabel import ChunkLabeler
from chunklabel.llm import OpenAIClient

labeler = ChunkLabeler(client=OpenAIClient(model="llama3", api_key="not-used"))

Note: local models must support JSON-mode structured output.

Downstream use cases

The Chunk list produced by chunklabel is designed as input for further analysis:

NLI: score the relationship between hypotheses and chunk categories
NER: analyze co-occurrence between entity labels and categories
Relation extraction: map entity-pair relations to chunk categories
Conditional generation: use category as a conditioning signal for language models

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.3.0

Jun 19, 2026

0.2.0

Jun 8, 2026

0.1.9

Jun 8, 2026

0.1.8

May 19, 2026

0.1.7

May 18, 2026

0.1.6

May 13, 2026

0.1.5

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunklabel-0.3.0.tar.gz (112.5 kB view details)

Uploaded Jun 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunklabel-0.3.0-py3-none-any.whl (10.6 kB view details)

Uploaded Jun 19, 2026 Python 3

File details

Details for the file chunklabel-0.3.0.tar.gz.

File metadata

Download URL: chunklabel-0.3.0.tar.gz
Upload date: Jun 19, 2026
Size: 112.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for chunklabel-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`5033636d730bf7ae79eecc4c2bc3cc1254e0c5d021260c66f05fa1ca4d5decc5`
MD5	`1e04ba9362921c6e2395b3fdbd5d808b`
BLAKE2b-256	`9d617413f17fd73fa080295b675f0a30fbfa8ea3b95b2b48ea40112b5a3fe239`

See more details on using hashes here.

File details

Details for the file chunklabel-0.3.0-py3-none-any.whl.

File metadata

Download URL: chunklabel-0.3.0-py3-none-any.whl
Upload date: Jun 19, 2026
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for chunklabel-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bb49869f6504def347b5238998c271b06bf1d3b83329c5898b1a30f669f69037`
MD5	`fe4eb884a513f533b9ae9e70728142c5`
BLAKE2b-256	`28f48210f8ec770aed5442b72c65ac00f6d8d92e9e638c79d4bf9110dd8fd1a9`

See more details on using hashes here.

chunklabel 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

chunklabel

Overview

Installation

Data structures

Pipeline

Lenient mode

Category normalization (offline)

Configuration

Using local LLMs

Downstream use cases

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes