Semantic document chunking library
Project description
Segmenta
Semantic document chunking for retrieval (RAG), search, and downstream analytics.
Segmenta transforms unstructured documents (PDF, Markdown, plain text) into semantically isolated chunks enriched with structured metadata. It is built as a composable pipeline with pluggable embedding + LLM providers and enterprise-friendly observability (including prompt/response traces when supported by the provider).
What You Get
- Retrieval-first chunk boundaries (topic purity over broad grouping)
- PDF-aware segmentation (handles "long paragraph" extraction realities)
- Metadata enrichment per chunk (title, summary, intent, keywords)
- Deterministic, inspectable pipeline stages with metrics
- LLM trace logs (what was sent to the model and what came back, when supported)
Supported Inputs
- PDF (PyMuPDF)
- Markdown
- Plain text (
.txt,.text)
Installation
pip install segmenta
From source:
git clone https://github.com/your-org/segmenta.git
cd segmenta
pip install -e .
Quick Start
from segmenta import Segmenta
segmenta = Segmenta(
openai_api_key="sk-...",
model="gpt-4o",
)
result = segmenta.chunk(
input_file="document.pdf",
output_dir="./output",
)
print("chunks:", result.chunk_count)
print("output:", result.output_path)
How Chunking Works (Step-by-Step)
flowchart LR
A["Input: PDF / MD / TXT"] --> B["Parse"]
B --> C["Segment: paragraphs"]
C --> P["Plan granularity (LLM)"]
P --> D["Atomize (adaptive): sentence groups"]
D --> E["Embedding similarity"]
E --> F["Propose boundaries"]
F --> G["LLM validates boundaries"]
G --> H["Form chunks"]
H --> I["LLM enriches metadata"]
I --> J["Write output: Markdown + YAML frontmatter"]
I --> K["Write trace: JSONL prompt/response log (provider-supported)"]
sequenceDiagram
participant D as Document
participant P as Parser
participant E as Embeddings
participant L as LLM
participant O as Output
D->>P: parse(input)
P->>P: segment(paragraphs)
P->>L: plan granularity (topics, expected chunks)
L-->>P: plan (JSON)
P->>P: atomize (adaptive)
P->>E: embed(text units)
E-->>P: adjacent similarities
P->>L: validate(boundary candidates)
L-->>P: KEEP/MERGE/ADJUST
P->>P: form chunks
P->>L: enrich(metadata per chunk)
L-->>P: title/summary/keywords/intent
P->>O: write Markdown output
P->>O: write JSONL trace (provider-supported)
Stage Outputs
- Parse
- Produces a
Documentwith sections, paragraphs, and raw text.
- Produces a
- Segment
- Extracts ordered
Paragraphitems in document order.
- Extracts ordered
- Granularity plan (LLM)
- Sends a compact sample of the document to the LLM to estimate:
- topics covered
- expected chunk count for retrieval
- recommended atomization settings
- Sends a compact sample of the document to the LLM to estimate:
- Atomize (adaptive)
- Splits long paragraphs into smaller sentence groups before boundary detection based on the plan.
- This is especially important for PDFs where extraction often collapses multiple ideas into a single "paragraph."
- Boundary detect (embeddings)
- Computes semantic similarity between adjacent paragraph/atom embeddings.
- Proposes boundaries where similarity drops below
similarity_threshold.
- Boundary validate (LLM)
- Validates each proposed boundary for retrieval quality:
KEEP(split is correct)MERGE(should be continuous)ADJUST(boundary should move nearby)
- Validates each proposed boundary for retrieval quality:
- Chunk form
- Groups paragraphs/atoms into chunks using final boundaries.
- Enrich (LLM)
- Extracts structured metadata per chunk.
- Output
- Writes chunked Markdown and provider-supported JSONL debug traces.
Output Format
Each chunk is emitted as Markdown with YAML frontmatter:
---
chunk_id: chunk_001
title: Example Chunk Title
summary: 1-2 sentence summary of the chunk.
intent: explains
questions:
- What question would a user ask to retrieve this chunk?
- Another question phrased differently using key terms?
keywords:
- keyword1
- keyword2
parent_section: Some Section
token_count: 58
---
Chunk content goes here...
Reference: PDF Example (Distributed 12-Topic Test)
This repository uses a regression-style PDF designed to test topic separation when concepts are related but intentionally distributed.
Input Reality (PDF Parsing)
PDF extraction commonly yields a small number of long paragraphs (even when the source document is conceptually separated). Without additional processing, that limits the maximum number of chunk boundaries.
Segmenta's atomize stage addresses this by splitting long paragraphs into smaller sentence groups before boundary detection.
Distributed_12_Topic_Semantic_Document.pdf
-> extracted paragraphs: 7
-> atomized sentence groups: 21
-> final semantic chunks: 14
Visual Chunking (PDF ↔ Chunks)
Side-by-side visualization: the source PDF with chunk-colored overlays (left) and the corresponding chunks with matching colors (right).
Example Output: 14 Chunks
Below is an example chunk set produced from Distributed_12_Topic_Semantic_Document.pdf (titles + token counts taken from generated output):
| Chunk | Title | Intent | Tokens |
|---|---|---|---|
| chunk_001 | Organizational Mobility Framework | explains | 102 |
| chunk_002 | Financial Planning in Workforce Mobility | explains | 35 |
| chunk_003 | Financial Discipline in Vendor Management | explains | 59 |
| chunk_004 | Employee Transition Support Policies | explains | 64 |
| chunk_005 | Relocation Success Factors | explains | 24 |
| chunk_006 | Domestic vs International Mobility Programs | compares | 36 |
| chunk_007 | Mitigating Compliance and Tax Communication Risks | warns | 27 |
| chunk_008 | Cross-Border Workforce Planning Risks | explains | 24 |
| chunk_009 | Employee Responsibility in Relocation | explains | 57 |
| chunk_010 | Promoting Accountability and Transparency | explains | 19 |
| chunk_011 | Change Management in Relocation Programs | explains | 32 |
| chunk_012 | Effective Communication in Organizational Change | explains | 45 |
| chunk_013 | Relocation Data Tracking and Improvement | explains | 57 |
| chunk_014 | Importance of Data in Mobility Programs | explains | 21 |
Topic Coverage Map
| Semantic area (PDF) | Chunk IDs |
|---|---|
| Governance + eligibility | chunk_001 |
| Financial planning | chunk_002 |
| Vendor management | chunk_003 |
| Housing assistance + settling-in services | chunk_004, chunk_005 |
| International compliance + tax treatment | chunk_006, chunk_007, chunk_008 |
| Employee responsibility + repayment obligations | chunk_009, chunk_010 |
| Change management + communication strategy | chunk_011, chunk_012 |
| Data tracking + continuous improvement | chunk_013, chunk_014 |
Configuration
Granularity Controls
similarity_threshold- Lower values propose fewer boundaries.
- Higher values propose more boundaries.
atomize_sentences_per_paragraph- If > 0, long paragraphs can be split into sentence groups before boundary detection.
atomize_min_sentences- Only atomize paragraphs that meet or exceed this sentence count.
granularity_planning_enabled- If true, an initial LLM pass estimates expected chunk count and sets atomization parameters.
granularity_max_paragraphs- Limits how many paragraphs are sampled and sent to the planner.
granularity_max_chars_per_paragraph- Limits how much text per paragraph is included in the planner prompt.
YAML Example
# segmenta.yaml
chunking:
similarity_threshold: 0.5
atomize_sentences_per_paragraph: 2
atomize_min_sentences: 6
granularity_planning_enabled: true
granularity_max_paragraphs: 60
granularity_max_chars_per_paragraph: 280
behavior:
retry_attempts: 2
fallback_enabled: true
continue_on_error: false
verbose: false
llm:
model: gpt-4o
temperature: 0.1
LLM Trace Logging (Prompt/Response Audit)
When using the built-in OpenAI-compatible provider, Segmenta writes a JSONL trace file in the output directory:
Segmenta_llm_debug_<input_stem>_<utc_timestamp>.jsonl
If granularity planning is enabled, Segmenta also writes a machine-readable plan:
Segmenta_granularity_plan_<input_stem>_<utc_timestamp>.json
Each llm_call record includes:
promptsystem_promptresponsetokens_usedsuccess/error
Example record:
{
"type": "llm_call",
"run_id": "MyDoc_20260205_001122",
"timestamp": "2026-02-05T00:11:22.000000+00:00",
"prompt": "...",
"system_prompt": "...",
"response": "...",
"tokens_used": 452,
"success": true,
"error": null
}
Security note: these logs can contain sensitive document content. Store and handle them accordingly.
Metrics and Observability
Each chunk(...) call returns a SegmentaResult that includes:
chunksandoutput_pathwarnings/errorsmetrics(stage timings and counts)
Common metric keys:
parse_time,segment_time,atomize_time,boundary_detect_time,boundary_validate_time,enrich_time,output_time,total_timeparagraph_count,paragraphs_before_atomize,paragraphs_after_atomizeboundary_proposals_count,boundaries_kept,boundaries_merged
Security and Data Handling
- Segmenta processes documents locally for parsing and embedding generation (Sentence Transformers by default).
- Chunk metadata enrichment and boundary validation send text to the configured LLM endpoint.
- Trace logs write prompt/response content to disk; treat output directories as sensitive data stores.
CLI
# Basic usage
segmenta document.pdf -o ./output --verbose
# Dry run (no LLM calls)
segmenta document.md -o ./output --dry-run
Provider Interop (OpenAI-Compatible Endpoints)
Segmenta can be pointed at OpenAI-compatible gateways (example: Groq) by using base_url:
import os
from segmenta import Segmenta, SegmentaConfig
from segmenta.llm import OpenAIProvider
llm = OpenAIProvider(
api_key=os.environ["GROQ_API_KEY"],
base_url="https://api.groq.com/openai/v1",
model="llama-3.3-70b-versatile",
)
segmenta = (
Segmenta.builder()
.with_config(SegmentaConfig(verbose=True))
.with_llm_provider(llm)
.build()
)
result = segmenta.chunk("document.pdf", output_dir="./output")
Extending Segmenta
Custom Parser
from segmenta.parsers.base import DocumentParser
from segmenta.parsers import ParserFactory
class DocxParser(DocumentParser):
def supported_extensions(self):
return [".docx"]
def parse(self, file_path):
...
ParserFactory.register(".docx", DocxParser)
Custom LLM Provider
from segmenta.llm.base import LLMProvider, LLMResponse
class CustomProvider(LLMProvider):
@property
def model_name(self) -> str:
return "custom-model"
def complete(self, prompt: str, system_prompt=None) -> LLMResponse:
...
def complete_json(self, prompt: str, system_prompt=None) -> dict:
...
License
Apache License 2.0
Key Dependencies
pymupdf(PDF parsing)markdown-it-py(Markdown parsing)sentence-transformers(embeddings)openai(OpenAI-compatible LLM calls)tiktoken(token counting)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file segmenta-1.0.0.tar.gz.
File metadata
- Download URL: segmenta-1.0.0.tar.gz
- Upload date:
- Size: 55.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
263261919678e14e00260b7d1bd3cb85859456853bd52f535e683d2d2157f3cd
|
|
| MD5 |
5ed12c0c78e2a7db153849524e11b5c8
|
|
| BLAKE2b-256 |
fe848863e0f1837c88cbaf17bfb7d6f9af0649b4c2bf53d76467ed6b2e29cb4a
|
File details
Details for the file segmenta-1.0.0-py3-none-any.whl.
File metadata
- Download URL: segmenta-1.0.0-py3-none-any.whl
- Upload date:
- Size: 69.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a386cc2f7913c319d5a22a5620301f33c53f0c606de40d511474f2147c99f4f4
|
|
| MD5 |
963b5a139d2ca0557b432401dffc4c73
|
|
| BLAKE2b-256 |
4392082aa27f124f3cdf9b2a690c12ba209d13e2ebf5d4ebfdaba4395a229a22
|