Skip to main content

Structure-aware semantic chunking, minus the heavyweight stack.

Project description

smantic

Structure-aware semantic chunking, minus the heavyweight stack.

PyPI CI License Open in Spaces

Good retrieval starts with good chunks. A chunk that staples two unrelated ideas together poisons its embedding. A chunk that splits a thought in half loses the context that made it useful. Most splitters cut every N characters and hope for the best. smantic cuts at the seams the document already has.

It keeps code, tables, and formulas intact (splitting only the big ones, by AST or row group or equation), finds real topic boundaries inside prose using sentence embeddings, tracks the heading trail, and merges the runts. It does all of that with no torch, no transformers, and no paddle. The semantic model is all-MiniLM-L6-v2 running on plain onnxruntime plus the Rust tokenizers library, so the whole thing installs small and runs on any CPU.

(Yes, the e fell out of "semantic". That is the joke. It is also a hint: this is about structure plus semantics.)

It is the chunking half of a parse plus chunk pipeline. Its sibling NoPaddle turns a PDF into typed regions; smantic turns those regions (or any Markdown) into chunks. They snap together.

Install

pip install "smantic[onnx]"     # with semantic boundary detection (recommended)
pip install smantic             # core only: structural chunking, no model

The [onnx] extra pulls onnxruntime, tokenizers, and huggingface_hub. The embedding model (~90 MB) is downloaded from the Hugging Face Hub the first time you chunk, then cached on disk. Without the extra, smantic still chunks on structural boundaries alone (headings, block edges, size limits); it just skips the semantic ones.

Quick start

import smantic

chunks = smantic.chunk_markdown(open("notes.md").read())

for c in chunks:
    print(c.sequence, c.dominant_type, c.token_count, c.metadata.get("nearest_heading"))
    print(c.content[:120])

Already have a parsed document? Feed it straight in:

import smantic

doc = smantic.from_docling_json(open("parsed.json").read())
chunks = smantic.chunk_document(doc, max_tokens=500, overlap_tokens=50)

Pairs with NoPaddle

import nopaddle, smantic

doc = nopaddle.parse_pdf("paper.pdf")          # PDF  -> typed regions
chunks = smantic.chunk_document(smantic.from_nopaddle(doc))   # regions -> chunks

from_nopaddle reads a NoPaddle ParsedDocument (object, dict, or JSON) with no conversion step: the two projects share the same region shape, and smantic's IR accepts NoPaddle's regions key directly.

Command line

smantic notes.md                       # JSONL, one chunk per line
smantic notes.md --format summary      # human-readable table
smantic parsed.json --input-format json --format json -o chunks.json
cat notes.md | smantic - --max-tokens 400 --overlap 40

--input-format defaults to auto (by file extension: .json is parsed JSON, everything else is Markdown). --format is jsonl (default), json, or summary.

How it works

smantic walks the document once and classifies every element:

  1. Atomic blocks (code, table, formula, picture, ...) stay intact. A block over the size limit is split into a parent plus children, by Python or JavaScript AST for code, by row group (headers repeated) for tables, and by environment or step for formula derivations. Visual blocks become a chunk only when they carry a caption or alt text worth retrieving.

  2. Prose runs through three-tier boundary detection:

    • hard boundaries always cut: section headings, and transitions between incompatible element types.
    • soft boundaries cut when the chunk is already big enough: a drop in sentence-to-sentence cosine similarity (the semantic part), or a new paragraph.
    • emergency boundaries cut anywhere once a chunk would blow past max_tokens.
  3. Headings are accumulated into a trail (heading_trail, nearest_heading, heading_level) and folded into the first chunk under them, so the heading is searchable without wasting a chunk on it.

  4. Cleanups: consecutive prose chunks share a configurable token overlap for context continuity; chunks below a useful minimum get merged into a same-type neighbour; parser-artifact duplicates (a sentence repeated by a multi-column OCR pass, say) get collapsed; and low-value backmatter (References, Acknowledgments, Funding, and friends) is skipped.

When the embedding model is not installed, the soft semantic boundary is simply skipped. Everything else still works.

Input formats

Source Helper Notes
Markdown text from_markdown(text) headings, code fences, pipe tables, $$/\[ math, lists, images, prose
NoPaddle output from_nopaddle(doc) object, dict, or JSON; reads the regions key natively
Docling-style JSON from_docling_json(text) {"pages": [{"elements": [...]}]}
The IR directly from_dict(d) / from_json(s) smantic's own shape

All of them build a smantic.Document, which is what the chunker consumes.

Output

chunk_document returns a list of Chunk objects. chunk.to_dict() gives a JSON-ready dict:

{
  "content": "...",              # the chunk text
  "token_count": 312,
  "page_numbers": [4],
  "span_start": 0, "span_end": 1840,
  "chunking_method": "semantic", # semantic | atomic_block | ast_split | row_group | ...
  "dominant_type": "prose",      # prose | code_block | table_block | formula_block | visual_block
  "has_code": false, "has_math": false, "has_table": false,
  "parent_chunk_id": null,       # set by you if you persist the parent->child links
  "block_sequence": null,        # order of a child within a split block
  "metadata": {                  # heading_trail, nearest_heading, section_type, timecodes, ...
    "heading_trail": ["Methods", "Training"],
    "nearest_heading": "Training"
  },
  "sequence": 7                  # position in the returned list
}

Self-host (FastAPI + Docker)

pip install "smantic[onnx,serve]"
smantic-serve                                  # serves on 0.0.0.0:8000
curl -s localhost:8000/chunk \
  -H 'content-type: application/json' \
  -d '{"text": "# Title\n\nSome prose to chunk."}' | jq

POST /chunk takes either text (Markdown) or document (a parsed dict), plus optional max_tokens / overlap_tokens. There is a GET /healthz and a GET /info. One chunker is kept warm across requests; set SMANTIC_RELEASE_AFTER_REQUEST=1 to free the model after each call instead.

Or run the image:

docker build -f docker/Dockerfile -t smantic .
docker run -p 8000:8000 -v smantic-models:/models smantic

Configuration

Every knob has an env var (SMANTIC_*) and a code path:

Env Default Meaning
SMANTIC_MAX_TOKENS 500 soft max tokens per chunk
SMANTIC_OVERLAP_TOKENS 50 token overlap between prose chunks
SMANTIC_BOUNDARY_THRESHOLD 0.5 cosine threshold for a soft semantic cut
SMANTIC_EMBED_REPO sentence-transformers/all-MiniLM-L6-v2 embedding model repo
MODEL_CACHE_DIR ~/.cache/smantic/models where the model is cached

Why it is light

No torch. No transformers. No paddle. The core is numpy plus stdlib; the [onnx] extra adds onnxruntime and the Rust tokenizers library and nothing else. The embedding model is the 384-dim all-MiniLM-L6-v2 ONNX graph, run with a host-side mean-pool, so there is no deep-learning framework in the dependency tree at all.

Status

Alpha. The chunker is ported from a production ingestion pipeline and is well covered by tests (the core suite runs offline against a graceful fallback, so it is fast and needs no model download; the real-embedder path is exercised by the slow-marked tests). The Markdown parser is a pragmatic block segmenter, not a full CommonMark implementation: it handles the constructs that matter for chunking and leaves inline formatting untouched.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smantic-0.1.0.tar.gz (77.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smantic-0.1.0-py3-none-any.whl (49.8 kB view details)

Uploaded Python 3

File details

Details for the file smantic-0.1.0.tar.gz.

File metadata

  • Download URL: smantic-0.1.0.tar.gz
  • Upload date:
  • Size: 77.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for smantic-0.1.0.tar.gz
Algorithm Hash digest
SHA256 289d82658c5e2f1a96791a83a4202824862b9157481b8e2eb1149fd49da7e194
MD5 575ce6b2dbdb531570e25f30b898336f
BLAKE2b-256 c555fcce9ac8cd7925de9c0cd96777e1da8cf783b03a5cd3d2755548907abbde

See more details on using hashes here.

Provenance

The following attestation bundles were made for smantic-0.1.0.tar.gz:

Publisher: release.yml on beimichen/smantic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file smantic-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: smantic-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 49.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for smantic-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bdc22a6be6d992f2c3ab0f2ac843f4e09e624bf62489089d57ecba67dcbb7d81
MD5 0aa6be654168a5c68322a027b31c0de1
BLAKE2b-256 2726a11c834fbc4d51c05370c0ed346c8afe634eaa86a11d849e17833b747279

See more details on using hashes here.

Provenance

The following attestation bundles were made for smantic-0.1.0-py3-none-any.whl:

Publisher: release.yml on beimichen/smantic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page