Split text into semantically coherent, LLM-categorized chunks
Project description
seam
A Python library for splitting text into categorized chunks using an LLM.
Overview
seam segments text into semantically coherent spans, assigning a free-form category to each. Categories are named by the LLM without a predefined schema. Each chunk's quote is a verbatim excerpt from the source text, aligned back to the original after LLM output.
from seam import Seam
seam = Seam()
chunks = seam.split(
"The project kicked off in January with a small team. "
"Budget constraints forced a scope reduction in March. "
"Despite the setbacks, the product launched successfully in June."
)
# [
# Chunk(category="initiation", quote="The project kicked off in January with a small team", start=0, end=51),
# Chunk(category="obstacle", quote="Budget constraints forced a scope reduction in March", start=53, end=104),
# Chunk(category="outcome", quote="the product launched successfully in June", start=120, end=160),
# ]
Installation
pip install seam
Data structures
The LLM returns raw chunks without span information. Alignment is performed as a separate step, producing the final Chunk with character-level positions.
# Intermediate: LLM output
@dataclass
class RawChunk:
category: str # Free-form category name assigned by the LLM
quote: str # Verbatim excerpt (may contain minor transcription noise)
# Final: after alignment
@dataclass
class Chunk:
category: str # Same as RawChunk
quote: str # Excerpt aligned to source text
start: int # Start index in source text
end: int # End index in source text
Pipeline
Input text
│
▼
LLM → [{category, quote}, ...] (RawChunk list)
│
▼
rapidfuzz alignment → (start, end) resolved per chunk
│
▼
Span post-processing (lenient mode)
│ gap-filling / overlap resolution
▼
Chunk list
Lenient mode
- Gaps: unassigned spans between chunks are filled automatically as
category="uncategorized" - Overlaps: the earlier chunk takes priority; the later chunk's start is pushed forward
Category normalization (offline)
After processing multiple texts, category names can drift across runs. A dedicated normalization step lets the LLM consolidate them in batch.
from seam import Normalizer
normalizer = Normalizer()
mapping = normalizer.build_mapping(all_chunks)
# {"kick-off": "initiation", "project start": "initiation", "blocker": "obstacle", ...}
normalized_chunks = normalizer.apply(all_chunks, mapping)
Normalization runs offline over the full category inventory, so the LLM can make globally consistent decisions rather than local ones.
Configuration
seam = Seam(
model="gpt-4o", # LLM model to use
fuzzy_threshold=80, # Match threshold for rapidfuzz alignment (0–100)
)
Using local LLMs
seam uses LangChain's BaseChatModel interface internally, so any compatible model can be passed via the llm parameter.
Ollama
from langchain_ollama import ChatOllama
from seam import Seam
seam = Seam(llm=ChatOllama(model="llama3"))
llama.cpp (OpenAI-compatible server)
from langchain_openai import ChatOpenAI
from seam import Seam
seam = Seam(llm=ChatOpenAI(
model="llama3",
base_url="http://localhost:8080/v1",
api_key="not-used",
))
Note: local models must support structured output (JSON mode). If with_structured_output is not reliable, wrap the model with a JSON-enforcing layer before passing it in.
Downstream use cases
The Chunk list produced by seam is designed as input for further analysis:
- NLI: score the relationship between hypotheses and chunk categories
- NER: analyze co-occurrence between entity labels and categories
- Relation extraction: map entity-pair relations to chunk categories
- Conditional generation: use category as a conditioning signal for language models
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunklabel-0.1.5.tar.gz.
File metadata
- Download URL: chunklabel-0.1.5.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a92da0e62fb53cb3877b8f93830a2a9761502893f72e3ab2bd6ae9453685c90
|
|
| MD5 |
89a3d72d6ccb5c93c265bd6a8f2aedca
|
|
| BLAKE2b-256 |
6062b43e033af8cf0e057d2eb7dc973e51b0f5cf7aeba2584452c7c0b479cac6
|
File details
Details for the file chunklabel-0.1.5-py3-none-any.whl.
File metadata
- Download URL: chunklabel-0.1.5-py3-none-any.whl
- Upload date:
- Size: 8.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8570a20d8785e5bf777d07e59e679698941d6cc3412fc699fd8a9002d7a4a38c
|
|
| MD5 |
808b6c456a7fcbbd6027243501392439
|
|
| BLAKE2b-256 |
5e7b251ef6a4359ae38c3cb3032dbc434a9bbbeb833f1984dbcebb8eaf52159c
|