Structure-aware semantic chunking, minus the heavyweight stack.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bei0001

These details have not been verified by PyPI

Project description

smantic

Structure-aware semantic chunking, minus the heavyweight stack.

Good retrieval starts with good chunks. A chunk that staples two unrelated ideas together poisons its embedding. A chunk that splits a thought in half loses the context that made it useful. Most splitters cut every N characters and hope for the best. smantic cuts at the seams the document already has.

It keeps code, tables, and formulas intact (splitting only the big ones, by AST or row group or equation), finds real topic boundaries inside prose using sentence embeddings, tracks the heading trail, and merges the runts. It does all of that with no torch, no transformers, and no paddle. The semantic model is all-MiniLM-L6-v2 running on plain onnxruntime plus the Rust tokenizers library, so the whole thing installs small and runs on any CPU.

(Yes, the e fell out of "semantic". That is the joke. It is also a hint: this is about structure plus semantics.)

It is the chunking half of a parse plus chunk pipeline. Its sibling NoPaddle turns a PDF into typed regions; smantic turns those regions (or any Markdown) into chunks. They snap together.

Install

pip install "smantic[onnx]"     # with semantic boundary detection (recommended)
pip install smantic             # core only: structural chunking, no model

The [onnx] extra pulls onnxruntime, tokenizers, and huggingface_hub. The embedding model (~90 MB) is downloaded from the Hugging Face Hub the first time you chunk, then cached on disk. Without the extra, smantic still chunks on structural boundaries alone (headings, block edges, size limits); it just skips the semantic ones.

Quick start

Feed it a Markdown string (or file) and smantic works out the structure for you:

import smantic

md = r"""
# Mixed Precision Training

Training in FP16 halves memory but underflows small gradients; loss scaling
multiplies the loss before backprop so they stay representable.

| Format | Bits | Exponent | Mantissa |
|--------|------|----------|----------|
| FP32   | 32   | 8        | 23       |
| FP16   | 16   | 5        | 10       |

The scaled loss keeps the same direction:

$$ \mathcal{L}_{scaled} = s \cdot \mathcal{L} $$
"""

chunks = smantic.chunk_markdown(md)

for c in chunks:
    print(c.sequence, c.dominant_type, f"{c.token_count}t", "|", c.metadata.get("nearest_heading"))

0 prose 73t | Mixed Precision Training
1 table_block 39t | Mixed Precision Training
2 formula_block 28t | Mixed Precision Training

The heading, prose, table, and formula each became their own chunk: prose is cut on semantic boundaries, while the table and formula stay intact as atomic blocks, and every chunk carries the heading trail. Each item is a Chunk; chunk.to_dict() gives a JSON-ready record (here the table chunk):

>>> chunks[1].to_dict()
{
  "content": "| Format | Bits | Exponent | Mantissa |\n|--------|---...",
  "token_count": 39,
  "page_numbers": [1],
  "span_start": 0, "span_end": 159,
  "chunking_method": "atomic_block",
  "dominant_type": "table_block",
  "parent_chunk_id": None, "block_sequence": None,
  "has_code": False, "has_math": False, "has_table": True,
  "metadata": {
    "element_type": "table",
    "heading_trail": ["Mixed Precision Training"],
    "heading_level": 1,
    "nearest_heading": "Mixed Precision Training",
  },
  "sequence": 1,
}

Reading from a file is the same: smantic.chunk_markdown(open("notes.md").read()).

Already have a parsed document? Feed it straight in:

import smantic

doc = smantic.from_docling_json(open("parsed.json").read())
chunks = smantic.chunk_document(doc, max_tokens=500, overlap_tokens=50)

Pairs with NoPaddle

import nopaddle, smantic

doc = nopaddle.parse_pdf("paper.pdf")          # PDF  -> typed regions
chunks = smantic.chunk_document(smantic.from_nopaddle(doc))   # regions -> chunks

from_nopaddle reads a NoPaddle ParsedDocument (object, dict, or JSON) with no conversion step: the two projects share the same region shape, and smantic's IR accepts NoPaddle's regions key directly.

A table NoPaddle detects survives intact: it lands in one table_block chunk (or, if it is huge, splits by row group with the header repeated), never shredded into prose:

chunks = smantic.chunk_document(smantic.from_nopaddle(doc))

tables = [c for c in chunks if c.dominant_type == "table_block"]
print(tables[0].has_table)   # True
print(tables[0].content)     # the GFM table, kept whole

Command line

smantic notes.md                       # JSONL, one chunk per line
smantic notes.md --format summary      # human-readable table
smantic parsed.json --input-format json --format json -o chunks.json
cat notes.md | smantic - --max-tokens 400 --overlap 40

--input-format defaults to auto (by file extension: .json is parsed JSON, everything else is Markdown). --format is jsonl (default), json, or summary.

How it works

smantic walks the document once and classifies every element:

Atomic blocks (code, table, formula, picture, ...) stay intact. A block over the size limit is split into a parent plus children, by Python or JavaScript AST for code, by row group (headers repeated) for tables, and by environment or step for formula derivations. Visual blocks become a chunk only when they carry a caption or alt text worth retrieving.
Prose runs through three-tier boundary detection:
- hard boundaries always cut: section headings, and transitions between incompatible element types.
- soft boundaries cut when the chunk is already big enough: a drop in sentence-to-sentence cosine similarity (the semantic part), or a new paragraph.
- emergency boundaries cut anywhere once a chunk would blow past max_tokens.
Headings are accumulated into a trail (heading_trail, nearest_heading, heading_level) and folded into the first chunk under them, so the heading is searchable without wasting a chunk on it.
Cleanups: consecutive prose chunks share a configurable token overlap for context continuity; chunks below a useful minimum get merged into a same-type neighbour; parser-artifact duplicates (a sentence repeated by a multi-column OCR pass, say) get collapsed; and low-value backmatter (References, Acknowledgments, Funding, and friends) is skipped.

When the embedding model is not installed, the soft semantic boundary is simply skipped. Everything else still works.

Input formats

Source	Helper	Notes
Markdown text	`from_markdown(text)`	headings, code fences, pipe tables, `$$`/`\[` math, lists, images, prose
NoPaddle output	`from_nopaddle(doc)`	object, dict, or JSON; reads the `regions` key natively
Docling-style JSON	`from_docling_json(text)`	`{"pages": [{"elements": [...]}]}`
The IR directly	`from_dict(d)` / `from_json(s)`	smantic's own shape

All of them build a smantic.Document, which is what the chunker consumes.

Output

chunk_document returns a list of Chunk objects. chunk.to_dict() gives a JSON-ready dict:

{
  "content": "...",              # the chunk text
  "token_count": 312,
  "page_numbers": [4],
  "span_start": 0, "span_end": 1840,
  "chunking_method": "semantic", # semantic | atomic_block | ast_split | row_group | ...
  "dominant_type": "prose",      # prose | code_block | table_block | formula_block | visual_block
  "has_code": false, "has_math": false, "has_table": false,
  "parent_chunk_id": null,       # set by you if you persist the parent->child links
  "block_sequence": null,        # order of a child within a split block
  "metadata": {                  # heading_trail, nearest_heading, section_type, timecodes, ...
    "heading_trail": ["Methods", "Training"],
    "nearest_heading": "Training"
  },
  "sequence": 7                  # position in the returned list
}

Self-host (FastAPI + Docker)

pip install "smantic[onnx,serve]"
smantic-serve                                  # serves on 0.0.0.0:8000

curl -s localhost:8000/chunk \
  -H 'content-type: application/json' \
  -d '{"text": "# Title\n\nSome prose to chunk."}' | jq

POST /chunk takes either text (Markdown) or document (a parsed dict), plus optional max_tokens / overlap_tokens. There is a GET /healthz and a GET /info. One chunker is kept warm across requests; set SMANTIC_RELEASE_AFTER_REQUEST=1 to free the model after each call instead.

Or run the image:

docker build -f docker/Dockerfile -t smantic .
docker run -p 8000:8000 -v smantic-models:/models smantic

Configuration

Every knob has an env var (SMANTIC_*) and a code path:

Env	Default	Meaning
`SMANTIC_MAX_TOKENS`	500	soft max tokens per chunk
`SMANTIC_OVERLAP_TOKENS`	50	token overlap between prose chunks
`SMANTIC_BOUNDARY_THRESHOLD`	0.5	cosine threshold for a soft semantic cut
`SMANTIC_EMBED_REPO`	`sentence-transformers/all-MiniLM-L6-v2`	embedding model repo
`MODEL_CACHE_DIR`	`~/.cache/smantic/models`	where the model is cached

Why it is light

No torch. No transformers. No paddle. The core is numpy plus stdlib; the [onnx] extra adds onnxruntime and the Rust tokenizers library and nothing else. The embedding model is the 384-dim all-MiniLM-L6-v2 ONNX graph, run with a host-side mean-pool, so there is no deep-learning framework in the dependency tree at all.

Status

Alpha. The chunker is ported from a production ingestion pipeline and is well covered by tests (the core suite runs offline against a graceful fallback, so it is fast and needs no model download; the real-embedder path is exercised by the slow-marked tests). The Markdown parser is a pragmatic block segmenter, not a full CommonMark implementation: it handles the constructs that matter for chunking and leaves inline formatting untouched.

License

Apache-2.0.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bei0001

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Jun 5, 2026

0.1.0

Jun 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smantic-0.1.1.tar.gz (79.3 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

smantic-0.1.1-py3-none-any.whl (50.8 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file smantic-0.1.1.tar.gz.

File metadata

Download URL: smantic-0.1.1.tar.gz
Upload date: Jun 5, 2026
Size: 79.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for smantic-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`c5ecc03f6895b18d305a2615a2f74cb7ae798ee5d5c297109fc202a8c072a286`
MD5	`65dd563dcd434c86519dd73b94e2359d`
BLAKE2b-256	`2be02b834cb08bcae3401b875729639c6cb428bba4db2b9d3e08c7838510f014`

See more details on using hashes here.

Provenance

The following attestation bundles were made for smantic-0.1.1.tar.gz:

Publisher: release.yml on beimichen/smantic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: smantic-0.1.1.tar.gz
- Subject digest: c5ecc03f6895b18d305a2615a2f74cb7ae798ee5d5c297109fc202a8c072a286
- Sigstore transparency entry: 1733163102
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: beimichen/smantic@fc7b2cc359e62c2c2f4cda09ece1c0925d195deb
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/beimichen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@fc7b2cc359e62c2c2f4cda09ece1c0925d195deb
- Trigger Event: push

File details

Details for the file smantic-0.1.1-py3-none-any.whl.

File metadata

Download URL: smantic-0.1.1-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 50.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for smantic-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`769019f96e6ffdde0fbb8e2b5e0d9d5c36d9b7262265147af78967e6613c0b97`
MD5	`b9d735ec8fd47f1e6985c4319cb54560`
BLAKE2b-256	`0fe12be42a0f10ccaa784d2071b09d7a163efa852ad86c996436b9be3fe62a9a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for smantic-0.1.1-py3-none-any.whl:

Publisher: release.yml on beimichen/smantic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: smantic-0.1.1-py3-none-any.whl
- Subject digest: 769019f96e6ffdde0fbb8e2b5e0d9d5c36d9b7262265147af78967e6613c0b97
- Sigstore transparency entry: 1733163163
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: beimichen/smantic@fc7b2cc359e62c2c2f4cda09ece1c0925d195deb
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/beimichen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@fc7b2cc359e62c2c2f4cda09ece1c0925d195deb
- Trigger Event: push

smantic 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

smantic

Install

Quick start

Pairs with NoPaddle

Command line

How it works

Input formats

Output

Self-host (FastAPI + Docker)

Configuration

Why it is light

Status

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance