Skip to main content

Structure-aware semantic chunking, minus the heavyweight stack.

Project description

smantic

Structure-aware semantic chunking, minus the heavyweight stack.

PyPI CI License Open in Spaces

Good retrieval starts with good chunks. A chunk that staples two unrelated ideas together poisons its embedding. A chunk that splits a thought in half loses the context that made it useful. Most splitters cut every N characters and hope for the best. smantic cuts at the seams the document already has.

It keeps code, tables, and formulas intact (splitting only the big ones, by AST or row group or equation), finds real topic boundaries inside prose using sentence embeddings, tracks the heading trail, and merges the runts. It does all of that with no torch, no transformers, and no paddle. The semantic model is all-MiniLM-L6-v2 running on plain onnxruntime plus the Rust tokenizers library, so the whole thing installs small and runs on any CPU.

(Yes, the e fell out of "semantic". That is the joke. It is also a hint: this is about structure plus semantics.)

It is the chunking half of a parse plus chunk pipeline. Its sibling NoPaddle turns a PDF into typed regions; smantic turns those regions (or any Markdown) into chunks. They snap together.

Install

pip install "smantic[onnx]"     # with semantic boundary detection (recommended)
pip install smantic             # core only: structural chunking, no model

The [onnx] extra pulls onnxruntime, tokenizers, and huggingface_hub. The embedding model (~90 MB) is downloaded from the Hugging Face Hub the first time you chunk, then cached on disk. Without the extra, smantic still chunks on structural boundaries alone (headings, block edges, size limits); it just skips the semantic ones.

Quick start

Feed it a Markdown string (or file) and smantic works out the structure for you:

import smantic

md = r"""
# Mixed Precision Training

Training in FP16 halves memory but underflows small gradients; loss scaling
multiplies the loss before backprop so they stay representable.

| Format | Bits | Exponent | Mantissa |
|--------|------|----------|----------|
| FP32   | 32   | 8        | 23       |
| FP16   | 16   | 5        | 10       |

The scaled loss keeps the same direction:

$$ \mathcal{L}_{scaled} = s \cdot \mathcal{L} $$
"""

chunks = smantic.chunk_markdown(md)

for c in chunks:
    print(c.sequence, c.dominant_type, f"{c.token_count}t", "|", c.metadata.get("nearest_heading"))
0 prose 73t | Mixed Precision Training
1 table_block 39t | Mixed Precision Training
2 formula_block 28t | Mixed Precision Training

The heading, prose, table, and formula each became their own chunk: prose is cut on semantic boundaries, while the table and formula stay intact as atomic blocks, and every chunk carries the heading trail. Each item is a Chunk; chunk.to_dict() gives a JSON-ready record (here the table chunk):

>>> chunks[1].to_dict()
{
  "content": "| Format | Bits | Exponent | Mantissa |\n|--------|---...",
  "token_count": 39,
  "page_numbers": [1],
  "span_start": 0, "span_end": 159,
  "chunking_method": "atomic_block",
  "dominant_type": "table_block",
  "parent_chunk_id": None, "block_sequence": None,
  "has_code": False, "has_math": False, "has_table": True,
  "metadata": {
    "element_type": "table",
    "heading_trail": ["Mixed Precision Training"],
    "heading_level": 1,
    "nearest_heading": "Mixed Precision Training",
  },
  "sequence": 1,
}

Reading from a file is the same: smantic.chunk_markdown(open("notes.md").read()).

Already have a parsed document? Feed it straight in:

import smantic

doc = smantic.from_docling_json(open("parsed.json").read())
chunks = smantic.chunk_document(doc, max_tokens=500, overlap_tokens=50)

Pairs with NoPaddle

import nopaddle, smantic

doc = nopaddle.parse_pdf("paper.pdf")          # PDF  -> typed regions
chunks = smantic.chunk_document(smantic.from_nopaddle(doc))   # regions -> chunks

from_nopaddle reads a NoPaddle ParsedDocument (object, dict, or JSON) with no conversion step: the two projects share the same region shape, and smantic's IR accepts NoPaddle's regions key directly.

A table NoPaddle detects survives intact: it lands in one table_block chunk (or, if it is huge, splits by row group with the header repeated), never shredded into prose:

chunks = smantic.chunk_document(smantic.from_nopaddle(doc))

tables = [c for c in chunks if c.dominant_type == "table_block"]
print(tables[0].has_table)   # True
print(tables[0].content)     # the GFM table, kept whole

Command line

smantic notes.md                       # JSONL, one chunk per line
smantic notes.md --format summary      # human-readable table
smantic parsed.json --input-format json --format json -o chunks.json
cat notes.md | smantic - --max-tokens 400 --overlap 40

--input-format defaults to auto (by file extension: .json is parsed JSON, everything else is Markdown). --format is jsonl (default), json, or summary.

How it works

smantic walks the document once and classifies every element:

  1. Atomic blocks (code, table, formula, picture, ...) stay intact. A block over the size limit is split into a parent plus children, by Python or JavaScript AST for code, by row group (headers repeated) for tables, and by environment or step for formula derivations. Visual blocks become a chunk only when they carry a caption or alt text worth retrieving.

  2. Prose runs through three-tier boundary detection:

    • hard boundaries always cut: section headings, and transitions between incompatible element types.
    • soft boundaries cut when the chunk is already big enough: a drop in sentence-to-sentence cosine similarity (the semantic part), or a new paragraph.
    • emergency boundaries cut anywhere once a chunk would blow past max_tokens.
  3. Headings are accumulated into a trail (heading_trail, nearest_heading, heading_level) and folded into the first chunk under them, so the heading is searchable without wasting a chunk on it.

  4. Cleanups: consecutive prose chunks share a configurable token overlap for context continuity; chunks below a useful minimum get merged into a same-type neighbour; parser-artifact duplicates (a sentence repeated by a multi-column OCR pass, say) get collapsed; and low-value backmatter (References, Acknowledgments, Funding, and friends) is skipped.

When the embedding model is not installed, the soft semantic boundary is simply skipped. Everything else still works.

Input formats

Source Helper Notes
Markdown text from_markdown(text) headings, code fences, pipe tables, $$/\[ math, lists, images, prose
NoPaddle output from_nopaddle(doc) object, dict, or JSON; reads the regions key natively
Docling-style JSON from_docling_json(text) {"pages": [{"elements": [...]}]}
The IR directly from_dict(d) / from_json(s) smantic's own shape

All of them build a smantic.Document, which is what the chunker consumes.

Output

chunk_document returns a list of Chunk objects. chunk.to_dict() gives a JSON-ready dict:

{
  "content": "...",              # the chunk text
  "token_count": 312,
  "page_numbers": [4],
  "span_start": 0, "span_end": 1840,
  "chunking_method": "semantic", # semantic | atomic_block | ast_split | row_group | ...
  "dominant_type": "prose",      # prose | code_block | table_block | formula_block | visual_block
  "has_code": false, "has_math": false, "has_table": false,
  "parent_chunk_id": null,       # set by you if you persist the parent->child links
  "block_sequence": null,        # order of a child within a split block
  "metadata": {                  # heading_trail, nearest_heading, section_type, timecodes, ...
    "heading_trail": ["Methods", "Training"],
    "nearest_heading": "Training"
  },
  "sequence": 7                  # position in the returned list
}

Self-host (FastAPI + Docker)

pip install "smantic[onnx,serve]"
smantic-serve                                  # serves on 0.0.0.0:8000
curl -s localhost:8000/chunk \
  -H 'content-type: application/json' \
  -d '{"text": "# Title\n\nSome prose to chunk."}' | jq

POST /chunk takes either text (Markdown) or document (a parsed dict), plus optional max_tokens / overlap_tokens. There is a GET /healthz and a GET /info. One chunker is kept warm across requests; set SMANTIC_RELEASE_AFTER_REQUEST=1 to free the model after each call instead.

Or run the image:

docker build -f docker/Dockerfile -t smantic .
docker run -p 8000:8000 -v smantic-models:/models smantic

Configuration

Every knob has an env var (SMANTIC_*) and a code path:

Env Default Meaning
SMANTIC_MAX_TOKENS 500 soft max tokens per chunk
SMANTIC_OVERLAP_TOKENS 50 token overlap between prose chunks
SMANTIC_BOUNDARY_THRESHOLD 0.5 cosine threshold for a soft semantic cut
SMANTIC_EMBED_REPO sentence-transformers/all-MiniLM-L6-v2 embedding model repo
MODEL_CACHE_DIR ~/.cache/smantic/models where the model is cached

Why it is light

No torch. No transformers. No paddle. The core is numpy plus stdlib; the [onnx] extra adds onnxruntime and the Rust tokenizers library and nothing else. The embedding model is the 384-dim all-MiniLM-L6-v2 ONNX graph, run with a host-side mean-pool, so there is no deep-learning framework in the dependency tree at all.

Status

Alpha. The chunker is ported from a production ingestion pipeline and is well covered by tests (the core suite runs offline against a graceful fallback, so it is fast and needs no model download; the real-embedder path is exercised by the slow-marked tests). The Markdown parser is a pragmatic block segmenter, not a full CommonMark implementation: it handles the constructs that matter for chunking and leaves inline formatting untouched.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smantic-0.1.1.tar.gz (79.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smantic-0.1.1-py3-none-any.whl (50.8 kB view details)

Uploaded Python 3

File details

Details for the file smantic-0.1.1.tar.gz.

File metadata

  • Download URL: smantic-0.1.1.tar.gz
  • Upload date:
  • Size: 79.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for smantic-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c5ecc03f6895b18d305a2615a2f74cb7ae798ee5d5c297109fc202a8c072a286
MD5 65dd563dcd434c86519dd73b94e2359d
BLAKE2b-256 2be02b834cb08bcae3401b875729639c6cb428bba4db2b9d3e08c7838510f014

See more details on using hashes here.

Provenance

The following attestation bundles were made for smantic-0.1.1.tar.gz:

Publisher: release.yml on beimichen/smantic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file smantic-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: smantic-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 50.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for smantic-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 769019f96e6ffdde0fbb8e2b5e0d9d5c36d9b7262265147af78967e6613c0b97
MD5 b9d735ec8fd47f1e6985c4319cb54560
BLAKE2b-256 0fe12be42a0f10ccaa784d2071b09d7a163efa852ad86c996436b9be3fe62a9a

See more details on using hashes here.

Provenance

The following attestation bundles were made for smantic-0.1.1-py3-none-any.whl:

Publisher: release.yml on beimichen/smantic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page