Skip to main content

Open-source Python toolkit for building Vietnamese AI applications.

Project description

Nôm 喃

Open-source Python toolkit for building Vietnamese AI applications.

Named after chữ Nôm — the script Vietnam wrote in for a millennium.

License Status Python

A local-first toolkit. No data leaves your machine. Use any LLM (Ollama by default), any embedder, any document type — Nôm wires them into a Vietnamese-aware RAG pipeline you can ship as either a Python library or a deployable chat web app.


The 3-line demo

pip install "nom-vn[chat]"     # FastAPI + React UI + parsers + embeddings
nom serve                       # opens http://localhost:8080
# upload PDFs/Word/Excel/PowerPoint/images, ask questions in Vietnamese

Nôm — chat with citations grounded in indexed Vietnamese documents

The web app is built into the wheel — there's nothing else to install.


What ships today

Module What it does Status
nom.text Vietnamese text utilities — NFC, diacritic restoration, word/sentence tokenization
nom.chunking VN-aware document chunking
nom.embeddings Embedder Protocol + VietnameseEmbedder (BGE-base ft) + AITeamVNEmbedder (BGE-M3 ft)
nom.retrieve BM25Retriever, DenseRetriever, hybrid RRF fusion
nom.doc Document pipeline: PDF / DOCX / XLSX / PPTX / HTML / JSON / image (OCR) → text
nom.llm LLM Protocol + Ollama adapter (any model: Qwen3, Sailor2, Phi-4, …)
nom.rag One-line RAG composition (RAG.from_documents(...))
nom.chat FastAPI server + React/ShadCN UI, MemoryStore + SqliteStore + pluggable EmbeddingsCache

NotebookLM-style document Q&A web app

Three-pane editorial layout: spaces sidebar / chat thread / sources + studio. Dark editorial palette, sharp corners, citation traceability.

Three-pane editorial layout (1920×1080 desktop):

Default chat view — space selected, materials indexed, suggested questions

Citations are first-class. Every chunk number is a chip you can click to see the source passage:

Citations expanded — Vietnamese chunks shown inline


Browser viewers for every supported format

Click any material in the right panel — Original tab renders the file natively, Extracted tab shows what the chunker + embedder saw. PDFs / images use the browser's native viewer; Office formats render as structured HTML so the browser can show them without LibreOffice.

DOCX → editorial paragraphs PPTX → 16:10 slide cards XLSX → HTML tables with sheet picker
DOCX viewer PPTX viewer XLSX viewer

Library use (no web app)

from nom.rag import RAG
from nom.llm import Ollama

rag = RAG.from_documents(
    ["contract.pdf", "letter.docx", "Hợp đồng số HD-001..."],
    llm=Ollama(model="qwen3:8b"),
)

answer = rag.ask("Có bao nhiêu hợp đồng có phạt vi phạm?")
print(answer.text)         # the LLM's response
print(answer.citations)    # [(doc_idx, chunk_idx, score, text), ...]

Document extraction without RAG:

from nom.doc import extract
from nom.llm import Ollama

result = extract(
    "hop_dong.pdf",
    schema={"so_hop_dong": str, "ngay_ky": "date", "tong_gia_tri": "amount_vnd"},
    llm=Ollama(model="qwen3:8b"),
)

Text utilities without the rest:

from nom.text import normalize, fix_diacritics, word_tokenize

clean = normalize("Hợp đồng số 02/HĐ/2025")
fixed = fix_diacritics("Hop dong nay duoc lap")  # → "Hợp đồng này được lập"
toks  = word_tokenize("Thành phố Hồ Chí Minh")    # ["Thành phố", "Hồ Chí Minh"]

Install

pip install nom-vn                            # text + chunking + retrieve + rag (no I/O deps)
pip install "nom-vn[doc]"                     # + PDF / Office / OCR parsers
pip install "nom-vn[embeddings]"              # + sentence-transformers
pip install "nom-vn[llm]"                     # + httpx for Ollama / OpenAI-compat
pip install "nom-vn[chat]"                    # + FastAPI / uvicorn + everything above
pip install "nom-vn[all]"                     # the lot

OCR (image / scanned PDF) needs Tesseract installed system-wide:

# Debian/Ubuntu
sudo apt install tesseract-ocr tesseract-ocr-vie
# Conda
conda install -c conda-forge tesseract
# macOS
brew install tesseract tesseract-lang

nom serve auto-detects the Tesseract binary + finds vie.traineddata; if absent, image uploads index as zero chunks rather than failing.


Architecture in one line

7 layers (Primitives / Models / Retrieval / RAG / Storage / Application / Deployment), every meaningful boundary is a typing.Protocol. Local single-process today; the cloud path replaces three Protocol implementations and changes nothing in the application layer.

See docs/architecture.md for the full layered model, Protocol seam table, and scaling-path reference.


Documentation


License

Apache 2.0. Fine-tune, redistribute, commercialize freely. Please keep attribution.

Citation

@software{nom2026,
  title  = {Nôm: an open Python toolkit for Vietnamese AI applications},
  author = {Nguyen, Viet Anh and {Neural Research Lab}},
  year   = {2026},
  url    = {https://nrl.ai/nom},
  note   = {Apache 2.0}
}

Built by

Neural Research Lab — open-source AI tooling. Edge inference, private assistants, training, labeling.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nom_vn-0.2.2.tar.gz (3.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nom_vn-0.2.2-py3-none-any.whl (228.8 kB view details)

Uploaded Python 3

File details

Details for the file nom_vn-0.2.2.tar.gz.

File metadata

  • Download URL: nom_vn-0.2.2.tar.gz
  • Upload date:
  • Size: 3.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for nom_vn-0.2.2.tar.gz
Algorithm Hash digest
SHA256 ea560cb9c5cfa36598e9ebf4cbf7490758d52141e98e5abf6b228a32e98cc250
MD5 454977fab0c1e2e24801d9dfb1b3dbbd
BLAKE2b-256 df2fe03c680eb95e3619381d56213872016b938f0802512694a2cec297005867

See more details on using hashes here.

File details

Details for the file nom_vn-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: nom_vn-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 228.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for nom_vn-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 429e02a5bf00cd1a6f311ec54e163d8a6f5de97e0c70ab58e1268f4ac7a19089
MD5 eb01a43c2bf6a24c9ea6aec73fe63a15
BLAKE2b-256 5a5db05c32296032b9245aaf37267fa3ed1ac7621fbe3a2b4523d1c38854aa8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page