Semantic document chunking library

These details have not been verified by PyPI

Project links

Project description

Segmenta

Semantic document chunking for retrieval (RAG), search, and downstream analytics.

Segmenta transforms unstructured documents (PDF, Markdown, plain text) into semantically isolated chunks enriched with structured metadata. It is built as a composable pipeline with pluggable embedding + LLM providers and enterprise-friendly observability (including prompt/response traces when supported by the provider).

What You Get

Retrieval-first chunk boundaries (topic purity over broad grouping)
PDF-aware segmentation (handles "long paragraph" extraction realities)
Metadata enrichment per chunk (title, summary, intent, keywords)
Deterministic, inspectable pipeline stages with metrics
LLM trace logs (what was sent to the model and what came back, when supported)

Supported Inputs

PDF (PyMuPDF)
Markdown
Plain text (.txt, .text)

Installation

pip install segmenta

From source:

git clone https://github.com/your-org/segmenta.git
cd segmenta
pip install -e .

Quick Start

from segmenta import Segmenta

segmenta = Segmenta(
    openai_api_key="sk-...",
    model="gpt-4o",
)

result = segmenta.chunk(
    input_file="document.pdf",
    output_dir="./output",
)

print("chunks:", result.chunk_count)
print("output:", result.output_path)

How Chunking Works (Step-by-Step)

flowchart LR
  A["Input: PDF / MD / TXT"] --> B["Parse"]
  B --> C["Segment: paragraphs"]
  C --> P["Plan granularity (LLM)"]
  P --> D["Atomize (adaptive): sentence groups"]
  D --> E["Embedding similarity"]
  E --> F["Propose boundaries"]
  F --> G["LLM validates boundaries"]
  G --> H["Form chunks"]
  H --> I["LLM enriches metadata"]
  I --> J["Write output: Markdown + YAML frontmatter"]
  I --> K["Write trace: JSONL prompt/response log (provider-supported)"]

sequenceDiagram
  participant D as Document
  participant P as Parser
  participant E as Embeddings
  participant L as LLM
  participant O as Output

  D->>P: parse(input)
  P->>P: segment(paragraphs)
  P->>L: plan granularity (topics, expected chunks)
  L-->>P: plan (JSON)
  P->>P: atomize (adaptive)
  P->>E: embed(text units)
  E-->>P: adjacent similarities
  P->>L: validate(boundary candidates)
  L-->>P: KEEP/MERGE/ADJUST
  P->>P: form chunks
  P->>L: enrich(metadata per chunk)
  L-->>P: title/summary/keywords/intent
  P->>O: write Markdown output
  P->>O: write JSONL trace (provider-supported)

Stage Outputs

Parse
- Produces a Document with sections, paragraphs, and raw text.
Segment
- Extracts ordered Paragraph items in document order.
Granularity plan (LLM)
- Sends a compact sample of the document to the LLM to estimate:
  - topics covered
  - expected chunk count for retrieval
  - recommended atomization settings
Atomize (adaptive)
- Splits long paragraphs into smaller sentence groups before boundary detection based on the plan.
- This is especially important for PDFs where extraction often collapses multiple ideas into a single "paragraph."
Boundary detect (embeddings)
- Computes semantic similarity between adjacent paragraph/atom embeddings.
- Proposes boundaries where similarity drops below similarity_threshold.
Boundary validate (LLM)
- Validates each proposed boundary for retrieval quality:
  - KEEP (split is correct)
  - MERGE (should be continuous)
  - ADJUST (boundary should move nearby)
Chunk form
- Groups paragraphs/atoms into chunks using final boundaries.
Enrich (LLM)
- Extracts structured metadata per chunk.
Output
- Writes chunked Markdown and provider-supported JSONL debug traces.

Output Format

Each chunk is emitted as Markdown with YAML frontmatter:

---
chunk_id: chunk_001
title: Example Chunk Title
summary: 1-2 sentence summary of the chunk.
intent: explains
questions:
- What question would a user ask to retrieve this chunk?
- Another question phrased differently using key terms?
keywords:
- keyword1
- keyword2
parent_section: Some Section
token_count: 58
---

Chunk content goes here...

Reference: PDF Example (Distributed 12-Topic Test)

This repository uses a regression-style PDF designed to test topic separation when concepts are related but intentionally distributed.

Input Reality (PDF Parsing)

PDF extraction commonly yields a small number of long paragraphs (even when the source document is conceptually separated). Without additional processing, that limits the maximum number of chunk boundaries.

Segmenta's atomize stage addresses this by splitting long paragraphs into smaller sentence groups before boundary detection.

Distributed_12_Topic_Semantic_Document.pdf
  -> extracted paragraphs: 7
  -> atomized sentence groups: 21
  -> final semantic chunks: 14

Visual Chunking (PDF ↔ Chunks)

Side-by-side visualization: the source PDF with chunk-colored overlays (left) and the corresponding chunks with matching colors (right).

Segmenta chunking visualization: PDF highlights and chunk list

Example Output: 14 Chunks

Below is an example chunk set produced from Distributed_12_Topic_Semantic_Document.pdf (titles + token counts taken from generated output):

Chunk	Title	Intent	Tokens
chunk_001	Organizational Mobility Framework	explains	102
chunk_002	Financial Planning in Workforce Mobility	explains	35
chunk_003	Financial Discipline in Vendor Management	explains	59
chunk_004	Employee Transition Support Policies	explains	64
chunk_005	Relocation Success Factors	explains	24
chunk_006	Domestic vs International Mobility Programs	compares	36
chunk_007	Mitigating Compliance and Tax Communication Risks	warns	27
chunk_008	Cross-Border Workforce Planning Risks	explains	24
chunk_009	Employee Responsibility in Relocation	explains	57
chunk_010	Promoting Accountability and Transparency	explains	19
chunk_011	Change Management in Relocation Programs	explains	32
chunk_012	Effective Communication in Organizational Change	explains	45
chunk_013	Relocation Data Tracking and Improvement	explains	57
chunk_014	Importance of Data in Mobility Programs	explains	21

Topic Coverage Map

Semantic area (PDF)	Chunk IDs
Governance + eligibility	chunk_001
Financial planning	chunk_002
Vendor management	chunk_003
Housing assistance + settling-in services	chunk_004, chunk_005
International compliance + tax treatment	chunk_006, chunk_007, chunk_008
Employee responsibility + repayment obligations	chunk_009, chunk_010
Change management + communication strategy	chunk_011, chunk_012
Data tracking + continuous improvement	chunk_013, chunk_014

Configuration

Granularity Controls

similarity_threshold
- Lower values propose fewer boundaries.
- Higher values propose more boundaries.
atomize_sentences_per_paragraph
- If > 0, long paragraphs can be split into sentence groups before boundary detection.
atomize_min_sentences
- Only atomize paragraphs that meet or exceed this sentence count.
granularity_planning_enabled
- If true, an initial LLM pass estimates expected chunk count and sets atomization parameters.
granularity_max_paragraphs
- Limits how many paragraphs are sampled and sent to the planner.
granularity_max_chars_per_paragraph
- Limits how much text per paragraph is included in the planner prompt.

YAML Example

# segmenta.yaml
chunking:
  similarity_threshold: 0.5
  atomize_sentences_per_paragraph: 2
  atomize_min_sentences: 6
  granularity_planning_enabled: true
  granularity_max_paragraphs: 60
  granularity_max_chars_per_paragraph: 280

behavior:
  retry_attempts: 2
  fallback_enabled: true
  continue_on_error: false
  verbose: false

llm:
  model: gpt-4o
  temperature: 0.1

LLM Trace Logging (Prompt/Response Audit)

When using the built-in OpenAI-compatible provider, Segmenta writes a JSONL trace file in the output directory:

Segmenta_llm_debug_<input_stem>_<utc_timestamp>.jsonl

If granularity planning is enabled, Segmenta also writes a machine-readable plan:

Segmenta_granularity_plan_<input_stem>_<utc_timestamp>.json

Each llm_call record includes:

prompt
system_prompt
response
tokens_used
success / error

Example record:

{
  "type": "llm_call",
  "run_id": "MyDoc_20260205_001122",
  "timestamp": "2026-02-05T00:11:22.000000+00:00",
  "prompt": "...",
  "system_prompt": "...",
  "response": "...",
  "tokens_used": 452,
  "success": true,
  "error": null
}

Security note: these logs can contain sensitive document content. Store and handle them accordingly.

Metrics and Observability

Each chunk(...) call returns a SegmentaResult that includes:

chunks and output_path
warnings / errors
metrics (stage timings and counts)

Common metric keys:

parse_time, segment_time, atomize_time, boundary_detect_time, boundary_validate_time, enrich_time, output_time, total_time
paragraph_count, paragraphs_before_atomize, paragraphs_after_atomize
boundary_proposals_count, boundaries_kept, boundaries_merged

Security and Data Handling

Segmenta processes documents locally for parsing and embedding generation (Sentence Transformers by default).
Chunk metadata enrichment and boundary validation send text to the configured LLM endpoint.
Trace logs write prompt/response content to disk; treat output directories as sensitive data stores.

CLI

# Basic usage
segmenta document.pdf -o ./output --verbose

# Dry run (no LLM calls)
segmenta document.md -o ./output --dry-run

Provider Interop (OpenAI-Compatible Endpoints)

Segmenta can be pointed at OpenAI-compatible gateways (example: Groq) by using base_url:

import os
from segmenta import Segmenta, SegmentaConfig
from segmenta.llm import OpenAIProvider

llm = OpenAIProvider(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1",
    model="llama-3.3-70b-versatile",
)

segmenta = (
    Segmenta.builder()
    .with_config(SegmentaConfig(verbose=True))
    .with_llm_provider(llm)
    .build()
)

result = segmenta.chunk("document.pdf", output_dir="./output")

Extending Segmenta

Custom Parser

from segmenta.parsers.base import DocumentParser
from segmenta.parsers import ParserFactory

class DocxParser(DocumentParser):
    def supported_extensions(self):
        return [".docx"]

    def parse(self, file_path):
        ...

ParserFactory.register(".docx", DocxParser)

Custom LLM Provider

from segmenta.llm.base import LLMProvider, LLMResponse

class CustomProvider(LLMProvider):
    @property
    def model_name(self) -> str:
        return "custom-model"

    def complete(self, prompt: str, system_prompt=None) -> LLMResponse:
        ...

    def complete_json(self, prompt: str, system_prompt=None) -> dict:
        ...

License

Apache License 2.0

Key Dependencies

pymupdf (PDF parsing)
markdown-it-py (Markdown parsing)
sentence-transformers (embeddings)
openai (OpenAI-compatible LLM calls)
tiktoken (token counting)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Feb 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

segmenta-1.0.0.tar.gz (55.9 kB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

segmenta-1.0.0-py3-none-any.whl (69.7 kB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file segmenta-1.0.0.tar.gz.

File metadata

Download URL: segmenta-1.0.0.tar.gz
Upload date: Feb 5, 2026
Size: 55.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for segmenta-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`263261919678e14e00260b7d1bd3cb85859456853bd52f535e683d2d2157f3cd`
MD5	`5ed12c0c78e2a7db153849524e11b5c8`
BLAKE2b-256	`fe848863e0f1837c88cbaf17bfb7d6f9af0649b4c2bf53d76467ed6b2e29cb4a`

See more details on using hashes here.

File details

Details for the file segmenta-1.0.0-py3-none-any.whl.

File metadata

Download URL: segmenta-1.0.0-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 69.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for segmenta-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a386cc2f7913c319d5a22a5620301f33c53f0c606de40d511474f2147c99f4f4`
MD5	`963b5a139d2ca0557b432401dffc4c73`
BLAKE2b-256	`4392082aa27f124f3cdf9b2a690c12ba209d13e2ebf5d4ebfdaba4395a229a22`

See more details on using hashes here.

segmenta 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Segmenta

What You Get

Supported Inputs

Installation

Quick Start

How Chunking Works (Step-by-Step)

Stage Outputs

Output Format

Reference: PDF Example (Distributed 12-Topic Test)

Input Reality (PDF Parsing)

Visual Chunking (PDF ↔ Chunks)

Example Output: 14 Chunks

Topic Coverage Map

Configuration

Granularity Controls

YAML Example

LLM Trace Logging (Prompt/Response Audit)

Metrics and Observability

Security and Data Handling

CLI

Provider Interop (OpenAI-Compatible Endpoints)

Extending Segmenta

Custom Parser

Custom LLM Provider

License

Key Dependencies

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes