Skip to main content

Semantic document chunking library

Project description

Segmenta

Semantic document chunking for retrieval (RAG), search, and downstream analytics.

Segmenta transforms unstructured documents (PDF, Markdown, plain text) into semantically isolated chunks enriched with structured metadata. It is built as a composable pipeline with pluggable embedding + LLM providers and enterprise-friendly observability (including prompt/response traces when supported by the provider).

What You Get

  • Retrieval-first chunk boundaries (topic purity over broad grouping)
  • PDF-aware segmentation (handles "long paragraph" extraction realities)
  • Metadata enrichment per chunk (title, summary, intent, keywords)
  • Deterministic, inspectable pipeline stages with metrics
  • LLM trace logs (what was sent to the model and what came back, when supported)

Supported Inputs

  • PDF (PyMuPDF)
  • Markdown
  • Plain text (.txt, .text)

Installation

pip install segmenta

From source:

git clone https://github.com/your-org/segmenta.git
cd segmenta
pip install -e .

Quick Start

from segmenta import Segmenta

segmenta = Segmenta(
    openai_api_key="sk-...",
    model="gpt-4o",
)

result = segmenta.chunk(
    input_file="document.pdf",
    output_dir="./output",
)

print("chunks:", result.chunk_count)
print("output:", result.output_path)

How Chunking Works (Step-by-Step)

flowchart LR
  A["Input: PDF / MD / TXT"] --> B["Parse"]
  B --> C["Segment: paragraphs"]
  C --> P["Plan granularity (LLM)"]
  P --> D["Atomize (adaptive): sentence groups"]
  D --> E["Embedding similarity"]
  E --> F["Propose boundaries"]
  F --> G["LLM validates boundaries"]
  G --> H["Form chunks"]
  H --> I["LLM enriches metadata"]
  I --> J["Write output: Markdown + YAML frontmatter"]
  I --> K["Write trace: JSONL prompt/response log (provider-supported)"]
sequenceDiagram
  participant D as Document
  participant P as Parser
  participant E as Embeddings
  participant L as LLM
  participant O as Output

  D->>P: parse(input)
  P->>P: segment(paragraphs)
  P->>L: plan granularity (topics, expected chunks)
  L-->>P: plan (JSON)
  P->>P: atomize (adaptive)
  P->>E: embed(text units)
  E-->>P: adjacent similarities
  P->>L: validate(boundary candidates)
  L-->>P: KEEP/MERGE/ADJUST
  P->>P: form chunks
  P->>L: enrich(metadata per chunk)
  L-->>P: title/summary/keywords/intent
  P->>O: write Markdown output
  P->>O: write JSONL trace (provider-supported)

Stage Outputs

  1. Parse
    • Produces a Document with sections, paragraphs, and raw text.
  2. Segment
    • Extracts ordered Paragraph items in document order.
  3. Granularity plan (LLM)
    • Sends a compact sample of the document to the LLM to estimate:
      • topics covered
      • expected chunk count for retrieval
      • recommended atomization settings
  4. Atomize (adaptive)
    • Splits long paragraphs into smaller sentence groups before boundary detection based on the plan.
    • This is especially important for PDFs where extraction often collapses multiple ideas into a single "paragraph."
  5. Boundary detect (embeddings)
    • Computes semantic similarity between adjacent paragraph/atom embeddings.
    • Proposes boundaries where similarity drops below similarity_threshold.
  6. Boundary validate (LLM)
    • Validates each proposed boundary for retrieval quality:
      • KEEP (split is correct)
      • MERGE (should be continuous)
      • ADJUST (boundary should move nearby)
  7. Chunk form
    • Groups paragraphs/atoms into chunks using final boundaries.
  8. Enrich (LLM)
    • Extracts structured metadata per chunk.
  9. Output
    • Writes chunked Markdown and provider-supported JSONL debug traces.

Output Format

Each chunk is emitted as Markdown with YAML frontmatter:

---
chunk_id: chunk_001
title: Example Chunk Title
summary: 1-2 sentence summary of the chunk.
intent: explains
questions:
- What question would a user ask to retrieve this chunk?
- Another question phrased differently using key terms?
keywords:
- keyword1
- keyword2
parent_section: Some Section
token_count: 58
---

Chunk content goes here...

Reference: PDF Example (Distributed 12-Topic Test)

This repository uses a regression-style PDF designed to test topic separation when concepts are related but intentionally distributed.

Input Reality (PDF Parsing)

PDF extraction commonly yields a small number of long paragraphs (even when the source document is conceptually separated). Without additional processing, that limits the maximum number of chunk boundaries.

Segmenta's atomize stage addresses this by splitting long paragraphs into smaller sentence groups before boundary detection.

Distributed_12_Topic_Semantic_Document.pdf
  -> extracted paragraphs: 7
  -> atomized sentence groups: 21
  -> final semantic chunks: 14

Visual Chunking (PDF ↔ Chunks)

Side-by-side visualization: the source PDF with chunk-colored overlays (left) and the corresponding chunks with matching colors (right).

Segmenta chunking visualization: PDF highlights and chunk list

Example Output: 14 Chunks

Below is an example chunk set produced from Distributed_12_Topic_Semantic_Document.pdf (titles + token counts taken from generated output):

Chunk Title Intent Tokens
chunk_001 Organizational Mobility Framework explains 102
chunk_002 Financial Planning in Workforce Mobility explains 35
chunk_003 Financial Discipline in Vendor Management explains 59
chunk_004 Employee Transition Support Policies explains 64
chunk_005 Relocation Success Factors explains 24
chunk_006 Domestic vs International Mobility Programs compares 36
chunk_007 Mitigating Compliance and Tax Communication Risks warns 27
chunk_008 Cross-Border Workforce Planning Risks explains 24
chunk_009 Employee Responsibility in Relocation explains 57
chunk_010 Promoting Accountability and Transparency explains 19
chunk_011 Change Management in Relocation Programs explains 32
chunk_012 Effective Communication in Organizational Change explains 45
chunk_013 Relocation Data Tracking and Improvement explains 57
chunk_014 Importance of Data in Mobility Programs explains 21

Topic Coverage Map

Semantic area (PDF) Chunk IDs
Governance + eligibility chunk_001
Financial planning chunk_002
Vendor management chunk_003
Housing assistance + settling-in services chunk_004, chunk_005
International compliance + tax treatment chunk_006, chunk_007, chunk_008
Employee responsibility + repayment obligations chunk_009, chunk_010
Change management + communication strategy chunk_011, chunk_012
Data tracking + continuous improvement chunk_013, chunk_014

Configuration

Granularity Controls

  • similarity_threshold
    • Lower values propose fewer boundaries.
    • Higher values propose more boundaries.
  • atomize_sentences_per_paragraph
    • If > 0, long paragraphs can be split into sentence groups before boundary detection.
  • atomize_min_sentences
    • Only atomize paragraphs that meet or exceed this sentence count.
  • granularity_planning_enabled
    • If true, an initial LLM pass estimates expected chunk count and sets atomization parameters.
  • granularity_max_paragraphs
    • Limits how many paragraphs are sampled and sent to the planner.
  • granularity_max_chars_per_paragraph
    • Limits how much text per paragraph is included in the planner prompt.

YAML Example

# segmenta.yaml
chunking:
  similarity_threshold: 0.5
  atomize_sentences_per_paragraph: 2
  atomize_min_sentences: 6
  granularity_planning_enabled: true
  granularity_max_paragraphs: 60
  granularity_max_chars_per_paragraph: 280

behavior:
  retry_attempts: 2
  fallback_enabled: true
  continue_on_error: false
  verbose: false

llm:
  model: gpt-4o
  temperature: 0.1

LLM Trace Logging (Prompt/Response Audit)

When using the built-in OpenAI-compatible provider, Segmenta writes a JSONL trace file in the output directory:

  • Segmenta_llm_debug_<input_stem>_<utc_timestamp>.jsonl

If granularity planning is enabled, Segmenta also writes a machine-readable plan:

  • Segmenta_granularity_plan_<input_stem>_<utc_timestamp>.json

Each llm_call record includes:

  • prompt
  • system_prompt
  • response
  • tokens_used
  • success / error

Example record:

{
  "type": "llm_call",
  "run_id": "MyDoc_20260205_001122",
  "timestamp": "2026-02-05T00:11:22.000000+00:00",
  "prompt": "...",
  "system_prompt": "...",
  "response": "...",
  "tokens_used": 452,
  "success": true,
  "error": null
}

Security note: these logs can contain sensitive document content. Store and handle them accordingly.

Metrics and Observability

Each chunk(...) call returns a SegmentaResult that includes:

  • chunks and output_path
  • warnings / errors
  • metrics (stage timings and counts)

Common metric keys:

  • parse_time, segment_time, atomize_time, boundary_detect_time, boundary_validate_time, enrich_time, output_time, total_time
  • paragraph_count, paragraphs_before_atomize, paragraphs_after_atomize
  • boundary_proposals_count, boundaries_kept, boundaries_merged

Security and Data Handling

  • Segmenta processes documents locally for parsing and embedding generation (Sentence Transformers by default).
  • Chunk metadata enrichment and boundary validation send text to the configured LLM endpoint.
  • Trace logs write prompt/response content to disk; treat output directories as sensitive data stores.

CLI

# Basic usage
segmenta document.pdf -o ./output --verbose

# Dry run (no LLM calls)
segmenta document.md -o ./output --dry-run

Provider Interop (OpenAI-Compatible Endpoints)

Segmenta can be pointed at OpenAI-compatible gateways (example: Groq) by using base_url:

import os
from segmenta import Segmenta, SegmentaConfig
from segmenta.llm import OpenAIProvider

llm = OpenAIProvider(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1",
    model="llama-3.3-70b-versatile",
)

segmenta = (
    Segmenta.builder()
    .with_config(SegmentaConfig(verbose=True))
    .with_llm_provider(llm)
    .build()
)

result = segmenta.chunk("document.pdf", output_dir="./output")

Extending Segmenta

Custom Parser

from segmenta.parsers.base import DocumentParser
from segmenta.parsers import ParserFactory

class DocxParser(DocumentParser):
    def supported_extensions(self):
        return [".docx"]

    def parse(self, file_path):
        ...

ParserFactory.register(".docx", DocxParser)

Custom LLM Provider

from segmenta.llm.base import LLMProvider, LLMResponse

class CustomProvider(LLMProvider):
    @property
    def model_name(self) -> str:
        return "custom-model"

    def complete(self, prompt: str, system_prompt=None) -> LLMResponse:
        ...

    def complete_json(self, prompt: str, system_prompt=None) -> dict:
        ...

License

Apache License 2.0

Key Dependencies

  • pymupdf (PDF parsing)
  • markdown-it-py (Markdown parsing)
  • sentence-transformers (embeddings)
  • openai (OpenAI-compatible LLM calls)
  • tiktoken (token counting)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

segmenta-1.0.0.tar.gz (55.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

segmenta-1.0.0-py3-none-any.whl (69.7 kB view details)

Uploaded Python 3

File details

Details for the file segmenta-1.0.0.tar.gz.

File metadata

  • Download URL: segmenta-1.0.0.tar.gz
  • Upload date:
  • Size: 55.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for segmenta-1.0.0.tar.gz
Algorithm Hash digest
SHA256 263261919678e14e00260b7d1bd3cb85859456853bd52f535e683d2d2157f3cd
MD5 5ed12c0c78e2a7db153849524e11b5c8
BLAKE2b-256 fe848863e0f1837c88cbaf17bfb7d6f9af0649b4c2bf53d76467ed6b2e29cb4a

See more details on using hashes here.

File details

Details for the file segmenta-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: segmenta-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 69.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for segmenta-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a386cc2f7913c319d5a22a5620301f33c53f0c606de40d511474f2147c99f4f4
MD5 963b5a139d2ca0557b432401dffc4c73
BLAKE2b-256 4392082aa27f124f3cdf9b2a690c12ba209d13e2ebf5d4ebfdaba4395a229a22

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page