Open-source Python toolkit for building Vietnamese AI applications.

These details have not been verified by PyPI

Project links

Project description

Nôm 喃

Open-source Python toolkit for building Vietnamese AI applications.

Named after chữ Nôm — the script Vietnam wrote in for a millennium.

A local-first toolkit. No data leaves your machine. Use any LLM (Ollama by default), any embedder, any document type — Nôm wires them into a Vietnamese-aware RAG pipeline you can ship as either a Python library or a deployable chat web app.

The 3-line demo

pip install "nom-vn[chat]"     # FastAPI + React UI + parsers + embeddings
nom serve                       # opens http://localhost:8080
# upload PDFs/Word/Excel/PowerPoint/images, ask questions in Vietnamese

Nôm — chat with citations grounded in indexed Vietnamese documents

The web app is built into the wheel — there's nothing else to install.

What ships today

Module	What it does	Status
`nom.text`	Vietnamese text utilities — NFC, diacritic restoration, word/sentence tokenization	✅
`nom.chunking`	VN-aware document chunking	✅
`nom.embeddings`	`Embedder` Protocol + `VietnameseEmbedder` (BGE-base ft) + `AITeamVNEmbedder` (BGE-M3 ft)	✅
`nom.retrieve`	`BM25Retriever`, `DenseRetriever`, hybrid RRF fusion	✅
`nom.doc`	Document pipeline: PDF / DOCX / XLSX / PPTX / HTML / JSON / image (OCR) → text	✅
`nom.llm`	`LLM` Protocol + `Ollama` adapter (any model: Qwen3, Sailor2, Phi-4, …)	✅
`nom.rag`	One-line RAG composition (`RAG.from_documents(...)`)	✅
`nom.chat`	FastAPI server + React/ShadCN UI, `MemoryStore` + `SqliteStore` + pluggable `EmbeddingsCache`	✅

NotebookLM-style document Q&A web app

Three-pane editorial layout: spaces sidebar / chat thread / sources + studio. Dark editorial palette, sharp corners, citation traceability.

Three-pane editorial layout (1920×1080 desktop):

Default chat view — space selected, materials indexed, suggested questions

Citations are first-class. Every chunk number is a chip you can click to see the source passage:

Citations expanded — Vietnamese chunks shown inline

Browser viewers for every supported format

Click any material in the right panel — Original tab renders the file natively, Extracted tab shows what the chunker + embedder saw. PDFs / images use the browser's native viewer; Office formats render as structured HTML so the browser can show them without LibreOffice.

DOCX → editorial paragraphs	PPTX → 16:10 slide cards	XLSX → HTML tables with sheet picker

Library use (no web app)

from nom.rag import RAG
from nom.llm import Ollama

rag = RAG.from_documents(
    ["contract.pdf", "letter.docx", "Hợp đồng số HD-001..."],
    llm=Ollama(model="qwen3:8b"),
)

answer = rag.ask("Có bao nhiêu hợp đồng có phạt vi phạm?")
print(answer.text)         # the LLM's response
print(answer.citations)    # [(doc_idx, chunk_idx, score, text), ...]

Document extraction without RAG:

from nom.doc import extract
from nom.llm import Ollama

result = extract(
    "hop_dong.pdf",
    schema={"so_hop_dong": str, "ngay_ky": "date", "tong_gia_tri": "amount_vnd"},
    llm=Ollama(model="qwen3:8b"),
)

Text utilities without the rest:

from nom.text import normalize, fix_diacritics, word_tokenize

clean = normalize("Hợp đồng số 02/HĐ/2025")
fixed = fix_diacritics("Hop dong nay duoc lap")  # → "Hợp đồng này được lập"
toks  = word_tokenize("Thành phố Hồ Chí Minh")    # ["Thành phố", "Hồ Chí Minh"]

Install

pip install nom-vn                            # text + chunking + retrieve + rag (no I/O deps)
pip install "nom-vn[doc]"                     # + PDF / Office / OCR parsers
pip install "nom-vn[embeddings]"              # + sentence-transformers
pip install "nom-vn[llm]"                     # + httpx for Ollama / OpenAI-compat
pip install "nom-vn[chat]"                    # + FastAPI / uvicorn + everything above
pip install "nom-vn[all]"                     # the lot

OCR (image / scanned PDF) needs Tesseract installed system-wide:

# Debian/Ubuntu
sudo apt install tesseract-ocr tesseract-ocr-vie
# Conda
conda install -c conda-forge tesseract
# macOS
brew install tesseract tesseract-lang

nom serve auto-detects the Tesseract binary + finds vie.traineddata; if absent, image uploads index as zero chunks rather than failing.

Architecture in one line

7 layers (Primitives / Models / Retrieval / RAG / Storage / Application / Deployment), every meaningful boundary is a typing.Protocol. Local single-process today; the cloud path replaces three Protocol implementations and changes nothing in the application layer.

See docs/architecture.md for the full layered model, Protocol seam table, and scaling-path reference.

Documentation

docs/architecture.md — the 7-layer model, Protocol seams, scaling path, anti-architecture rules
docs/pipeline.md — the document-extraction pipeline end-to-end with per-stage picks
docs/benchmark.md — measured numbers per module
docs/sota_vn_2026q2.md — SOTA local LLM / embedding / OCR for Vietnamese (April 2026 snapshot, every claim cited)
docs/oss_landscape_2026q2.md — OSS local-AI / RAG landscape: patterns to steal, traps to avoid
benchmarks/ — reproducible measurement scripts (perf + retrieval + accuracy)
CONTRIBUTING.md — dev setup, PR rules
CHANGELOG.md — version history

License

Apache 2.0. Fine-tune, redistribute, commercialize freely. Please keep attribution.

Citation

@software{nom2026,
  title  = {Nôm: an open Python toolkit for Vietnamese AI applications},
  author = {Nguyen, Viet Anh and {Neural Research Lab}},
  year   = {2026},
  url    = {https://nrl.ai/nom},
  note   = {Apache 2.0}
}

Built by

Neural Research Lab — open-source AI tooling. Edge inference, private assistants, training, labeling.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

Apr 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nom_vn-0.2.2.tar.gz (3.1 MB view details)

Uploaded Apr 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nom_vn-0.2.2-py3-none-any.whl (228.8 kB view details)

Uploaded Apr 25, 2026 Python 3

File details

Details for the file nom_vn-0.2.2.tar.gz.

File metadata

Download URL: nom_vn-0.2.2.tar.gz
Upload date: Apr 25, 2026
Size: 3.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for nom_vn-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`ea560cb9c5cfa36598e9ebf4cbf7490758d52141e98e5abf6b228a32e98cc250`
MD5	`454977fab0c1e2e24801d9dfb1b3dbbd`
BLAKE2b-256	`df2fe03c680eb95e3619381d56213872016b938f0802512694a2cec297005867`

See more details on using hashes here.

File details

Details for the file nom_vn-0.2.2-py3-none-any.whl.

File metadata

Download URL: nom_vn-0.2.2-py3-none-any.whl
Upload date: Apr 25, 2026
Size: 228.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for nom_vn-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`429e02a5bf00cd1a6f311ec54e163d8a6f5de97e0c70ab58e1268f4ac7a19089`
MD5	`eb01a43c2bf6a24c9ea6aec73fe63a15`
BLAKE2b-256	`5a5db05c32296032b9245aaf37267fa3ed1ac7621fbe3a2b4523d1c38854aa8c`

See more details on using hashes here.

nom-vn 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Nôm 喃

The 3-line demo

What ships today

NotebookLM-style document Q&A web app

Browser viewers for every supported format

Library use (no web app)

Install

Architecture in one line

Documentation

License

Citation

Built by

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes