Skip to main content

Split text into semantically coherent, LLM-categorized chunks

Project description

seam

A Python library for splitting text into categorized chunks using an LLM.

Overview

seam segments text into semantically coherent spans, assigning a free-form category to each. Categories are named by the LLM without a predefined schema. Each chunk's quote is a verbatim excerpt from the source text, aligned back to the original after LLM output.

from seam import Seam

seam = Seam()
chunks = seam.split(
    "The project kicked off in January with a small team. "
    "Budget constraints forced a scope reduction in March. "
    "Despite the setbacks, the product launched successfully in June."
)

# [
#   Chunk(category="initiation", quote="The project kicked off in January with a small team", start=0,   end=51),
#   Chunk(category="obstacle",   quote="Budget constraints forced a scope reduction in March", start=53,  end=104),
#   Chunk(category="outcome",    quote="the product launched successfully in June", start=120, end=160),
# ]

Installation

pip install seam

Data structures

The LLM returns raw chunks without span information. Alignment is performed as a separate step, producing the final Chunk with character-level positions.

# Intermediate: LLM output
@dataclass
class RawChunk:
    category: str   # Free-form category name assigned by the LLM
    quote: str      # Verbatim excerpt (may contain minor transcription noise)

# Final: after alignment
@dataclass
class Chunk:
    category: str   # Same as RawChunk
    quote: str      # Excerpt aligned to source text
    start: int      # Start index in source text
    end: int        # End index in source text

Pipeline

Input text
     │
     ▼
LLM  →  [{category, quote}, ...]   (RawChunk list)
     │
     ▼
rapidfuzz alignment  →  (start, end) resolved per chunk
     │
     ▼
Span post-processing  (lenient mode)
     │  gap-filling / overlap resolution
     ▼
Chunk list

Lenient mode

  • Gaps: unassigned spans between chunks are filled automatically as category="uncategorized"
  • Overlaps: the earlier chunk takes priority; the later chunk's start is pushed forward

Category normalization (offline)

After processing multiple texts, category names can drift across runs. A dedicated normalization step lets the LLM consolidate them in batch.

from seam import Normalizer

normalizer = Normalizer()
mapping = normalizer.build_mapping(all_chunks)
# {"kick-off": "initiation", "project start": "initiation", "blocker": "obstacle", ...}

normalized_chunks = normalizer.apply(all_chunks, mapping)

Normalization runs offline over the full category inventory, so the LLM can make globally consistent decisions rather than local ones.

Configuration

seam = Seam(
    model="gpt-4o",          # LLM model to use
    fuzzy_threshold=80,      # Match threshold for rapidfuzz alignment (0–100)
)

Using local LLMs

seam uses LangChain's BaseChatModel interface internally, so any compatible model can be passed via the llm parameter.

Ollama

from langchain_ollama import ChatOllama
from seam import Seam

seam = Seam(llm=ChatOllama(model="llama3"))

llama.cpp (OpenAI-compatible server)

from langchain_openai import ChatOpenAI
from seam import Seam

seam = Seam(llm=ChatOpenAI(
    model="llama3",
    base_url="http://localhost:8080/v1",
    api_key="not-used",
))

Note: local models must support structured output (JSON mode). If with_structured_output is not reliable, wrap the model with a JSON-enforcing layer before passing it in.

Downstream use cases

The Chunk list produced by seam is designed as input for further analysis:

  • NLI: score the relationship between hypotheses and chunk categories
  • NER: analyze co-occurrence between entity labels and categories
  • Relation extraction: map entity-pair relations to chunk categories
  • Conditional generation: use category as a conditioning signal for language models

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunklabel-0.1.5.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunklabel-0.1.5-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file chunklabel-0.1.5.tar.gz.

File metadata

  • Download URL: chunklabel-0.1.5.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for chunklabel-0.1.5.tar.gz
Algorithm Hash digest
SHA256 9a92da0e62fb53cb3877b8f93830a2a9761502893f72e3ab2bd6ae9453685c90
MD5 89a3d72d6ccb5c93c265bd6a8f2aedca
BLAKE2b-256 6062b43e033af8cf0e057d2eb7dc973e51b0f5cf7aeba2584452c7c0b479cac6

See more details on using hashes here.

File details

Details for the file chunklabel-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: chunklabel-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for chunklabel-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8570a20d8785e5bf777d07e59e679698941d6cc3412fc699fd8a9002d7a4a38c
MD5 808b6c456a7fcbbd6027243501392439
BLAKE2b-256 5e7b251ef6a4359ae38c3cb3032dbc434a9bbbeb833f1984dbcebb8eaf52159c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page