Skip to main content

High-performance NLP primitives for the Kelvin Agentic OS — SIMD string ops, multi-pattern matching, FST dictionaries, sentence segmentation, BM25 retrieval, fuzzy hashing

Project description

kaos-nlp-core

Part of Kelvin Agentic OS (KAOS) — open agentic infrastructure for legal work, built by 273 Ventures. See the full KAOS package map for the rest of the stack.

PyPI - Version Python License CI

kaos-nlp-core is a high-performance NLP primitives library for KAOS — a pure-Rust core with Python bindings via PyO3/Maturin. It provides the text-processing building blocks the rest of the stack relies on: SIMD-accelerated string operations, multi-pattern matching, finite-state transducers, sentence segmentation, BM25 retrieval, fuzzy hashing, and typed Python wrappers throughout.

It is dependency-light: the BASE install pulls only kaos-nlp-core itself plus the bundled Punkt sentence-segmenter model (~12 MB). Optional extras layer in the rest of the KAOS ecosystem.

Install

uv add kaos-nlp-core
# or
pip install kaos-nlp-core

kaos-nlp-core requires Python 3.13 or newer. The published wheels are cp313-abi3 — one wheel per OS/architecture covers every CPython 3.13+ minor (3.13, 3.14, 3.15, …). No re-release needed when 3.15 ships.

Platform coverage: Linux x86_64 (manylinux + musllinux), Linux aarch64 (manylinux + musllinux), macOS arm64, Windows x86_64, Windows arm64.

Quick start

from kaos_nlp_core import tokenizer, algorithms

# Two output shapes for tokenization:
#   tokenize_words → list[str]        — just the surface forms (fastest)
#   tokenize       → list[TokenSpan]  — .text / .start / .end when you
#                                       need character offsets back into
#                                       the source string
words = tokenizer.tokenize_words("kaos-nlp-core ships fast NLP primitives.")
print(words)
# ['kaos-nlp-core', 'ships', 'fast', 'NLP', 'primitives']

for s in tokenizer.tokenize("kaos-nlp-core ships fast NLP primitives.")[:3]:
    print(f"{s.start}-{s.end}: {s.text!r}")
# 0-13: 'kaos-nlp-core'
# 14-19: 'ships'
# 20-24: 'fast'

# Multi-byte safe (CJK + emoji) — offsets are CHARACTER offsets, not bytes
for s in tokenizer.tokenize("東京 emoji 😀 test"):
    print(f"{s.start}-{s.end}: {s.text!r}")
# 0-2: '東京'
# 3-8: 'emoji'
# 9-10: '😀'
# 11-15: 'test'

# Algorithms always return rich typed results
result = algorithms.levenshtein("kitten", "sitting")
print(f"distance={result.distance} similarity={result.similarity:.4f}")
# distance=3.0 similarity=0.5714

The _words shortcut exists wherever skipping offsets is meaningful work (tokenization). Everywhere else — segmentation (segment_sentences, segment_paragraphs, segment_lines), pattern matching, similarity algorithms — the API only ships the rich typed shape, because the metadata is the value.

Concepts

The package is organized around a small set of typed primitives.

Concept What it is
Algorithms kaos_nlp_core.algorithms — Levenshtein, Hamming, Jaro-Winkler, longest common substring, edit-distance variants. SIMD fast paths via stringzilla; ASCII fast paths before Unicode fallbacks.
Tokenizer kaos_nlp_core.tokenizer — Unicode-aware word/sentence tokenization with byte→char offset translation via build_byte_to_char_table(). Multi-byte safe (Latin diacritics, CJK, emoji).
Segmentation kaos_nlp_core.segmentation — Punkt sentence segmenter (bundled model models/default.npkt.gz, ~12 MB Apache-2.0 NLTK port).
Matching kaos_nlp_core.matching — Aho-Corasick multi-pattern matching, FST-backed fuzzy lookup via Levenshtein automata, regex.
Search kaos_nlp_core.search — BM25 retrieval, Searcher, sentence/paragraph search; pickle-safe with KNC magic header for index files.
Structures kaos_nlp_core.structuresVocabulary, InvertedIndex, SparseTermMatrix, SimilarityMatrix. Compact, pickle-safe, bincode-2.0 backed.
Hashing kaos_nlp_core.hashing — CTPH (context-triggered piecewise hashing) via blake3, MinHash, LSH index, near-duplicate grouping.
Lexicon kaos_nlp_core.lexicon — query expansion, semantic graph traversal, gazetteer lookups.
Documents kaos_nlp_core.documentsDocument, DocumentCollection with JSONL / HuggingFace loaders.
Quality kaos_nlp_core.quality — text-quality heuristics (token ratios, Unicode block distribution).

CLI

kaos-nlp-core ships a kaos-nlp administrative CLI plus an optional kaos-nlp-serve MCP server (loopback-only by default; --http requires KAOS_NLP_HTTP_TOKEN as an operator acknowledgement that a reverse proxy is fronting authentication):

kaos-nlp tokenize doc.txt --lowercase --json          # word tokenization with spans
kaos-nlp segment doc.txt --mode sentences             # sentence segmentation (Punkt)
kaos-nlp compare "Robert" "Rupert" --algorithm jaro-winkler
kaos-nlp find "pattern" doc.txt --case-insensitive    # SIMD substring search
kaos-nlp index build corpus.txt --output idx.kncidx   # native persisted index
kaos-nlp search --index idx.kncidx "query terms"      # ranked search (BM25 default)
kaos-nlp hash doc.txt --algorithm ctph                # fuzzy hash
kaos-nlp duplicates ./corpus/ --threshold 0.5         # near-duplicate detection
kaos-nlp encode "Robert" --algorithm soundex          # phonetic encoding
kaos-nlp vocab build doc.txt --type frequency         # build vocabulary
kaos-nlp analyze doc.txt --json                       # text statistics report

kaos-nlp-serve            # MCP server, stdio transport
kaos-nlp-serve --http     # MCP server, streamable HTTP (operator-token gated)

Every command supports --json for machine-readable output. CLI search reads both the native persisted index format (KNC) and legacy .json bundles.

Note: 17 MCP tools are registered by register_nlp_tools(). Until 0.1.0a2, the [mcp] extra is reserved but unpopulated — manually run pip install kaos-core kaos-mcp before using kaos-nlp-serve. Once siblings publish to PyPI, pip install kaos-nlp-core[mcp] will cover the full install. Until then kaos-nlp-serve exits with an actionable install hint if kaos-core or kaos-mcp are missing.

Compatibility & status

Aspect
Python 3.13, 3.14 (informational matrix entries for 3.14t free-threaded and 3.15-dev). One cp313-abi3 wheel per OS/arch covers all 3.13+ minors.
OS Linux (manylinux + musllinux, x86_64 + aarch64), macOS arm64, Windows x86_64, Windows arm64. macOS x86_64 deliberately skipped (Apple ended Intel sales in 2023).
Maturity Alpha. The public API is documented in kaos_nlp_core.__all__.
Stability policy Pre-1.0: minor bumps may change behaviour. Every change is documented in CHANGELOG.md.
Test coverage 298 Rust unit tests + Python pytest suite. Round-trip offset tests cover ASCII, multi-byte Latin, CJK, and emoji.
Type checker Validated with ty, Astral's Python type checker.

Companion packages

kaos-nlp-core is one of the packages in the Kelvin Agentic OS. The broader stack:

Package Layer What it does
kaos-core Core Foundational runtime, MCP-native types, registries, execution engine, VFS
kaos-content Core Typed document AST: Block/Inline, provenance, views
kaos-mcp Bridge FastMCP server, kaos management CLI, MCP resource templates
kaos-pdf Extraction PDF → AST with provenance
kaos-web Extraction Web extraction, browser automation, search, domain intelligence
kaos-office Extraction DOCX / PPTX / XLSX readers + writers to AST
kaos-tabular Extraction DuckDB-powered SQL analytics
kaos-source Data Government + financial data connectors (Federal Register, eCFR, EDGAR, GovInfo, PACER, GLEIF)
kaos-llm-client LLM Multi-provider LLM transport
kaos-llm-core LLM Typed LLM programming (Signatures, Programs, Optimizers)
kaos-nlp-core Primitives (Rust) High-performance NLP primitives
kaos-nlp-transformers ML Dense embeddings + retrieval
kaos-graph Primitives (Rust) Graph algorithms + RDF/SPARQL
kaos-ml-core Primitives (Rust) Classical ML on the document AST
kaos-citations Legal Legal citation extraction, resolution, verification
kaos-agents Agentic Agent runtime, memory, recipes
kaos-reference Sample Reference module for module authors

Packages depend on kaos-core; everything else is opt-in. Mix and match the ones you need.

Development

git clone https://github.com/273v/kaos-nlp-core
cd kaos-nlp-core
uv sync --group dev
uv run maturin develop --release

Install pre-commit hooks (recommended — they run the same checks as CI on every commit, scoped to staged files):

uvx pre-commit install
uvx pre-commit run --all-files     # one-time full sweep

Manual QA commands (the same set CI runs):

cargo fmt --check
cargo clippy --no-default-features --all-targets -- -D warnings
cargo test --no-default-features --lib
uv run ruff format --check python/kaos_nlp_core tests
uv run ruff check python/kaos_nlp_core tests
uv run ty check python/kaos_nlp_core tests
uv run pytest tests/

Build from source

uv build
uv pip install dist/*.whl

Contributing

Issues and pull requests are welcome. By contributing you certify the Developer Certificate of Origin v1.1 — sign every commit with git commit -s. Please open an issue before starting on a non-trivial change so we can align on scope.

Security

For security issues, please do not file a public issue. Report privately via GitHub Private Vulnerability Reporting or email security@273ventures.com. See SECURITY.md for the full disclosure policy.

License

Apache License 2.0 — see LICENSE and NOTICE.

Copyright 2026 273 Ventures LLC. Built for kelvin.legal.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kaos_nlp_core-0.1.0a2.tar.gz (58.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kaos_nlp_core-0.1.0a2-cp313-abi3-win_arm64.whl (48.6 MB view details)

Uploaded CPython 3.13+Windows ARM64

kaos_nlp_core-0.1.0a2-cp313-abi3-win_amd64.whl (48.8 MB view details)

Uploaded CPython 3.13+Windows x86-64

kaos_nlp_core-0.1.0a2-cp313-abi3-musllinux_1_2_x86_64.whl (49.9 MB view details)

Uploaded CPython 3.13+musllinux: musl 1.2+ x86-64

kaos_nlp_core-0.1.0a2-cp313-abi3-musllinux_1_2_aarch64.whl (49.1 MB view details)

Uploaded CPython 3.13+musllinux: musl 1.2+ ARM64

kaos_nlp_core-0.1.0a2-cp313-abi3-manylinux_2_28_x86_64.whl (49.6 MB view details)

Uploaded CPython 3.13+manylinux: glibc 2.28+ x86-64

kaos_nlp_core-0.1.0a2-cp313-abi3-manylinux_2_28_aarch64.whl (48.9 MB view details)

Uploaded CPython 3.13+manylinux: glibc 2.28+ ARM64

kaos_nlp_core-0.1.0a2-cp313-abi3-macosx_11_0_arm64.whl (49.0 MB view details)

Uploaded CPython 3.13+macOS 11.0+ ARM64

File details

Details for the file kaos_nlp_core-0.1.0a2.tar.gz.

File metadata

  • Download URL: kaos_nlp_core-0.1.0a2.tar.gz
  • Upload date:
  • Size: 58.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kaos_nlp_core-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 90a4afee1432786301a79c1046e5783a6e94bc8c1e6023d3455f9f706a3bd1e8
MD5 08ecc39cbff3aab698687813fb53b1ca
BLAKE2b-256 76b07417ac53b9c0fb3312bb091acea653f13dd6101b99ba7cf31e39a8dbec86

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_nlp_core-0.1.0a2.tar.gz:

Publisher: release.yml on 273v/kaos-nlp-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kaos_nlp_core-0.1.0a2-cp313-abi3-win_arm64.whl.

File metadata

File hashes

Hashes for kaos_nlp_core-0.1.0a2-cp313-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 bb625a1487f5c7673b422a7b1e67a772ce1b0168d83071c5aa1fb000736d0dc6
MD5 d191e74e396b2f87d4c538e9021a9be9
BLAKE2b-256 3cfa7fdd691cf62af462c04f3a039f7cf9f19d60157177cd2928dd627111c9e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_nlp_core-0.1.0a2-cp313-abi3-win_arm64.whl:

Publisher: release.yml on 273v/kaos-nlp-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kaos_nlp_core-0.1.0a2-cp313-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for kaos_nlp_core-0.1.0a2-cp313-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 09c10a20f9aaff5bf87903501ac065a989c55a92e6f7fa8be6e67c5da8746244
MD5 3660e9c73faa37fc1d78df2b52bfbd77
BLAKE2b-256 b58e0318cbf389f7cf351505e80a857292c4a1392876cb85d35c093b33316480

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_nlp_core-0.1.0a2-cp313-abi3-win_amd64.whl:

Publisher: release.yml on 273v/kaos-nlp-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kaos_nlp_core-0.1.0a2-cp313-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for kaos_nlp_core-0.1.0a2-cp313-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 7a0aebd7e77a47d9e24ad806fb10bf81474bf290f769f2054d6b1a9fcd7ec241
MD5 e2779114a300581570139e065f62d2c0
BLAKE2b-256 1f00be88f51e4e626a32b3ff5e03b397533c428010e2eec40c9828cf06716e9e

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_nlp_core-0.1.0a2-cp313-abi3-musllinux_1_2_x86_64.whl:

Publisher: release.yml on 273v/kaos-nlp-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kaos_nlp_core-0.1.0a2-cp313-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for kaos_nlp_core-0.1.0a2-cp313-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 46ab2830feb74c49c7715131cb961c7eee4cbdf61487184ed1f89c72e570c78f
MD5 eec7a3757c4bfd641deb0fa30aad7b63
BLAKE2b-256 730db0b96e27fe92ae329b1f80a42ab2110d3a634378c9ff1d06bb45eb40f2f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_nlp_core-0.1.0a2-cp313-abi3-musllinux_1_2_aarch64.whl:

Publisher: release.yml on 273v/kaos-nlp-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kaos_nlp_core-0.1.0a2-cp313-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for kaos_nlp_core-0.1.0a2-cp313-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c7599a7e24da67aba7dbbc0face38937f480b6375552d7ca92623cad097c3856
MD5 413d93bcdf279e0eefc1cfcf753aa459
BLAKE2b-256 3f2cebdc647e3c10da4ad6d66dffde38b308df917f39bbddfadc1eca0ce45bf4

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_nlp_core-0.1.0a2-cp313-abi3-manylinux_2_28_x86_64.whl:

Publisher: release.yml on 273v/kaos-nlp-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kaos_nlp_core-0.1.0a2-cp313-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for kaos_nlp_core-0.1.0a2-cp313-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 24b53cdaa8ffde33b5b462a2449f2d9ae0f139618acac0af9322c6e8083b85b1
MD5 09dc669ca9aeeba64572f087750383a3
BLAKE2b-256 9bf97cd4d8f34bc2efea6e42c24c4e98952f475dc7ee8cd23c5f36b253577f8c

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_nlp_core-0.1.0a2-cp313-abi3-manylinux_2_28_aarch64.whl:

Publisher: release.yml on 273v/kaos-nlp-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kaos_nlp_core-0.1.0a2-cp313-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kaos_nlp_core-0.1.0a2-cp313-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dcb8d13c92f2cd5c4c520d284d86dae8c77d65b8d6986c67a40a6f3654b011f5
MD5 373a1e06269e0fc686c6c8a4b91d8452
BLAKE2b-256 dc29a203908155af7b58defe9b94517516e28aa5c29e37f0b1948bdb6cec6739

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_nlp_core-0.1.0a2-cp313-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on 273v/kaos-nlp-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page