Open-source Indian vernacular AI text normalization toolkit (Sarvam-first, globally extensible).

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Open Vernacular AI Kit

open-vernacular-ai-kit is an open-source SDK + CLI for cleaning up Indian vernacular-English code-mixed text. This release is India-first with Sarvam AI integrations, and is designed to expand globally in future updates with community-contributed language and provider adapters. It is designed for messy WhatsApp-style inputs where vernacular text might appear in:

native script (example: ગુજરાતી)
Romanized vernacular text (example: Gujlish)
Mixed script in the same sentence

The goal is to normalize text before sending it to downstream models (Sarvam-M / Mayura / Sarvam-Translate), and to provide a reusable open-source foundation for vernacular AI workflows. Global language/provider expansion is planned and PR-friendly.

This repo is alpha-quality but SDK-first: the public API centers on CodeMixConfig + CodeMixPipeline.

Quick example:

gck codemix "maru business plan ready chhe!!!"
# -> મારું business plan ready છે!!

What We Solve

This project is a production-oriented normalization layer for India-focused AI applications. It cleans noisy mixed-script chat text before downstream LLM, retrieval, and support workflows.

Product positioning + landscape matrix: docs/what-we-solve.md
North-star metrics definitions and measurement method: docs/north-star-metrics.md

Developer Adoption Assets

Integration snippets (OpenAI, LangChain, RAG): docs/cookbook/integrations.md
Batch CLI recipes (support + ecommerce): docs/cookbook/batch-cli-recipes.md
Sarvam teacher mining for offline language improvement: docs/cookbook/sarvam-teacher.md
Before/after LLM uplift notebook: notebooks/before_after_llm_output.ipynb
Notebook dataset: docs/data/llm_uplift_examples.jsonl

Hard Cases (WhatsApp-Style)

Canonical output format (Gujarati-first profile):

Native-script tokens stay in their native script
English stays in Latin
Romanized vernacular tokens are transliterated to native script when possible

Input (messy)	Output (canonical code-mix)
`maru business plan ready chhe!!!`	`મારું business plan ready છે!!`
`maru mobile number 123 chhe`	`મારું mobile number 123 છે`
`maru order confirm chhe`	`મારું order confirm છે`
`aaje maru kaam ready chhe`	`આજે મારું કામ ready છે`
`tame ok chho?`	`તમે ok છો?`

Note: outputs depend on the selected transliteration backend and configuration. Use --stats to log what happened (backend used, how many romanized vernacular tokens were transliterated, etc). For stricter "keep English as English" behavior, consider enabling an optional Latin-token LID backend (see docs).

Reproducible Eval (Gujarati Baseline)

This repo includes a lightweight "coverage-style" eval harness (not translation quality) for the current Gujarati baseline:

gck eval --dataset gujlish --report eval/out/report.json

Example result from one local run (topk=1, max_rows=2000):

Split in22: pct_has_gujarati_codemix ~= 0.987
Split xnli: pct_has_gujarati_codemix ~= 0.956

See docs/benchmarks.md for details.

North-Star Baseline Snapshot (Current Release)

Generate the snapshot:

python3 scripts/snapshot_north_star_metrics.py --output docs/data/north_star_metrics_snapshot.json --iterations 200

Current snapshot (2026-03-20T19:56:48Z):

Metric	Value	Notes
`transliteration_success`	`1.000`	Golden transliteration accuracy across packaged Hindi/Gujarati cases (`90/90`; backend=`none`)
`dialect_accuracy`	`0.833`	Heuristic dialect-id accuracy (`5/6`)
`p95_latency_ms`	`0.216`	Pipeline p95 latency in ms (`iterations=200`, `n_calls=1200`)

Indian Language Coverage (This Release)

Current scope: India-first release. Gujarati is production-ready in this repo today; other Scheduled Indian languages are planned next and open for community PRs.

Language	Ready	Partially Ready	Planned (PR welcome)
Assamese	⬜	⬜	✅
Bengali	⬜	⬜	✅
Bodo	⬜	⬜	✅
Dogri	⬜	⬜	✅
Gujarati	✅	⬜	⬜
Hindi	⬜	✅	✅
Kannada	⬜	⬜	✅
Kashmiri	⬜	⬜	✅
Konkani	⬜	⬜	✅
Maithili	⬜	⬜	✅
Malayalam	⬜	⬜	✅
Manipuri	⬜	⬜	✅
Marathi	⬜	⬜	✅
Nepali	⬜	⬜	✅
Odia	⬜	⬜	✅
Punjabi	⬜	⬜	✅
Sanskrit	⬜	⬜	✅
Santali	⬜	⬜	✅
Sindhi	⬜	⬜	✅
Tamil	⬜	⬜	✅
Telugu	⬜	⬜	✅
Urdu	⬜	⬜	✅

Contribute

This release focuses on Indian languages and Sarvam-first hosted API flows. If you want to help expand global language coverage or add provider adapters, open a GitHub issue or submit a PR.

Open-Source Governance

This repository now includes a full public OSS governance baseline:

Contribution workflow: CONTRIBUTING.md
Security reporting policy: SECURITY.md
Community behavior policy: CODE_OF_CONDUCT.md
Support channels: SUPPORT.md
Ownership/review routing: .github/CODEOWNERS
PR and issue intake templates: .github/pull_request_template.md, .github/ISSUE_TEMPLATE/
Dependency update automation: .github/dependabot.yml
Release and RC process: RELEASE.md
Maintainer model and merge policy: GOVERNANCE.md

Repository Rename Notes

Repository target name for this release: open-vernacular-ai-kit.

If your local clone still points to the old remote, update it:

git remote set-url origin https://github.com/SudhirGadhvi/open-vernacular-ai-kit.git
git remote -v

Compatibility note: Python import path is open_vernacular_ai_kit.

Install

For full functionality (recommended):

python3 -m venv .venv
.venv/bin/python -m pip install -U pip
.venv/bin/pip install -e ".[api,indic,ml,eval,demo,dev,dialect-ml,rag]"

Minimal (CLI + basic heuristics only):

pip install -e .

CLI

Normalize text:

gck normalize "મારું business plan ready છે!!!"

Render clean code-mix (native-script tokens preserved, English preserved):

gck codemix "maru business plan ready chhe!!!"

Hindi beta language profile:

gck codemix --language hi "mera naam Sudhir hai"

Canonical output format (Gujarati-first profile):

Native-script tokens stay in native script
English stays in Latin
Romanized vernacular tokens are transliterated to native script when possible

Quick success metric (% Gujlish tokens transliterated):

gck codemix --stats "maru business plan ready chhe!!!" 1>/dev/null

Run eval (downloads public Gujlish eval CSVs into ~/.cache/open-vernacular-ai-kit):

gck eval --dataset gujlish --report eval/out/report.json

Dialect evals (uses a tiny packaged JSONL by default, or provide your own):

gck eval --dataset dialect_id
gck eval --dataset dialect_normalization

API Service (FastAPI)

Install API extras:

pip install -e ".[api,indic,ml,lexicon]"

Run service:

uvicorn open_vernacular_ai_kit.api_service:app --host 0.0.0.0 --port 8000

Endpoints:

GET /healthz
POST /normalize
POST /codemix
POST /analyze

See:

docs/api-service.md
docs/deploy.md

Docker Image

Build and run locally:

docker build -t ovak-api:local .
docker run --rm -p 8000:8000 ovak-api:local

Image publishing:

Docker workflow publishes to GHCR on v* tags (.github/workflows/docker.yml).

Demo (Streamlit)

Run On Localhost

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

Install demo dependencies:

pip install -e ".[demo,indic]"

Start the demo server:

.venv/bin/streamlit run demo/streamlit_app.py

Open in browser:

http://localhost:8501

Optional (enable Sarvam AI comparison in the UI):

pip install -e ".[sarvam]"
export SARVAM_API_KEY="your_key_here"

Then restart:

streamlit run demo/streamlit_app.py

If you export SARVAM_API_KEY, the demo can optionally call Sarvam APIs.

Troubleshooting (Local Demo)

streamlit: command not found
- Run with virtualenv binary:
```
.venv/bin/streamlit run demo/streamlit_app.py
```
Port 8501 already in use
- Start on a different port:
```
.venv/bin/streamlit run demo/streamlit_app.py --server.port 8502
```
- Then open http://localhost:8502.
Import/dependency errors in demo
- Reinstall required extras:
```
pip install -e ".[demo,indic]"
```
- For Sarvam features:
```
pip install -e ".[sarvam]"
```

Demo Screenshots

1) Landing / value overview

Open Vernacular AI Kit Hero

What this shows:

The app focus: normalize mixed vernacular + English text before LLM/search/routing.
Product value areas: LLM quality, retrieval quality, and analytics signal cleanup.
Starting point before running any analysis.

2) Live analysis (Before -> After)

Live Analysis Before and After

What this shows:

A raw romanized message in Before.
Canonicalized output in After with native-script conversions.
Conversion metrics (romanized tokens, converted count, conversion rate, backend).
Token-level changes table to inspect exactly what was transformed.

3) RAG section

RAG Mini-KB Section

What this shows:

The India-focused mini-KB retrieval panel.
Query input, preprocessing toggle, embeddings mode, and top-k controls.
A quick way to test retrieval quality on canonicalized inputs.

4) Settings panel (expanded)

Settings Panel Expanded

What this shows:

Full runtime controls for transliteration, numerals, backends, and model options.
Sarvam comparison toggles and advanced dialect-related settings.
The main place to configure behavior before running analysis.

5) Token LID panel (expanded)

Token LID Expanded

What this shows:

Token-by-token language tags and confidence scores.
Why each token was classified as native script, romanized, English, or other.
Useful for debugging lexicon rules and transliteration decisions.

6) Code-switching + dialect panel (expanded)

Code Switching and Dialect Expanded

What this shows:

CMI/switch-point metrics for mixed-language inputs.
Detected dialect label and confidence.
Quick diagnostics to understand how mixed or dialect-heavy an input is.

7) Batch helpers panel (expanded)

Batch Helpers Expanded

What this shows:

CSV and JSONL upload flows for bulk preprocessing.
Download-ready processed outputs for production pipelines.
The easiest way to run large input sets through the same normalization logic.

How To Use The Demo

Open the app and load an example (or paste your own user message).
Click Analyze to produce canonical text and conversion metrics.
Review Before vs After and the What Changed table.
(Optional) Open RAG and run retrieval on the same canonicalized text.
(Optional) Add SARVAM_API_KEY to enable model comparison in the AI section.

RAG Utilities (v0.5)

v0.5 adds small, optional RAG helpers intended for tiny curated corpora and demos:

RagIndex: build a small embeddings index and do top-k retrieval
load_vernacular_facts_tiny(): packaged India-focused mini dataset (docs + queries) for quick recall evals and demos
download_vernacular_facts_dataset(...): opt-in download helper (URLs required; offline-first by default)

To enable HF embedding models:

.venv/bin/pip install -e ".[rag-embeddings]"

Example (keyword embedder, no ML deps):

from open_vernacular_ai_kit import RagIndex, load_vernacular_facts_tiny

ds = load_vernacular_facts_tiny()

def keyword_embed(texts: list[str]) -> list[list[float]]:
    keys = ["gujarati", "hindi", "tamil", "kannada", "bengali", "marathi"]
    return [[1.0 if k in (t or "").lower() else 0.0 for k in keys] for t in texts]

idx = RagIndex.build(docs=ds.docs, embed_texts=keyword_embed, embedding_model="keywords")
hits = idx.search(
    query="which language is commonly used in gujarat customer support workflows (gujarati)?",
    embed_texts=keyword_embed,
    topk=3,
)
print([h.doc_id for h in hits])

Dialects (Full SDK)

Dialect support is offline-first and pluggable:

Dialect ID backends: heuristic (default), transformers (fine-tuned model), none
Dialect normalization backends: heuristic (rules), seq2seq (optional), auto (rules + optional seq2seq)

Safety default: remote HuggingFace model downloads are disabled unless you explicitly enable them:

SDK: allow_remote_models=False (default)
Demo: "Allow remote model downloads" checkbox (off by default)

Example (heuristic dialect normalization gated by confidence):

from open_vernacular_ai_kit import analyze_codemix

a = analyze_codemix(
    "kamaad thaalu rakhje",
    dialect_backend="heuristic",
    dialect_normalize=True,
    dialect_min_confidence=0.7,
)
print(a.codemix)

Training (Optional)

This repo includes simple training scripts (you provide data):

python3 scripts/train_dialect_id.py --train path/to/dialect_id_train.jsonl --output-dir out/dialect_id
python3 scripts/train_dialect_normalizer.py --train path/to/dialect_norm_train.jsonl --output-dir out/dialect_norm

Disclaimer

This is alpha software. Core code-mix rendering is designed to be stable, but dialect detection and normalization are limited by available labeled data and model choices.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

SudhirGadhvi

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.3.0

Mar 29, 2026

This version

1.2.0

Mar 20, 2026

1.1.0

Mar 9, 2026

1.0.2

Feb 28, 2026

1.0.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_vernacular_ai_kit-1.2.0.tar.gz (5.0 MB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

open_vernacular_ai_kit-1.2.0-py3-none-any.whl (2.7 MB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file open_vernacular_ai_kit-1.2.0.tar.gz.

File metadata

Download URL: open_vernacular_ai_kit-1.2.0.tar.gz
Upload date: Mar 20, 2026
Size: 5.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_vernacular_ai_kit-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`395485d7efbb592be4b08aa0ca812b8d66f8a67677d2f7b7024629453b651c7e`
MD5	`50438b4461bab8953f9201984d2db023`
BLAKE2b-256	`07c67b7c05cc9ab9e64a672efa9789f8f6b95bba2eba1371bfede2047c3e66f5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_vernacular_ai_kit-1.2.0.tar.gz:

Publisher: release.yml on SudhirGadhvi/open-vernacular-ai-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: open_vernacular_ai_kit-1.2.0.tar.gz
- Subject digest: 395485d7efbb592be4b08aa0ca812b8d66f8a67677d2f7b7024629453b651c7e
- Sigstore transparency entry: 1150105861
- Sigstore integration time: Mar 20, 2026
Source repository:
- Permalink: SudhirGadhvi/open-vernacular-ai-kit@2a6087cd2a9d18daa7ed2e226d611d984a3b73b6
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/SudhirGadhvi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2a6087cd2a9d18daa7ed2e226d611d984a3b73b6
- Trigger Event: push

File details

Details for the file open_vernacular_ai_kit-1.2.0-py3-none-any.whl.

File metadata

Download URL: open_vernacular_ai_kit-1.2.0-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 2.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_vernacular_ai_kit-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`185bce72030a571d10a95c6f83c183675f1bbbd1059c184b46bec954c2b3f128`
MD5	`1d47cf6cc458f75cc2cd05975eab20ce`
BLAKE2b-256	`c5535751f4d8d2c45393b4ac09f4c2cda51734293e8616b9885f5dbcafd8d75a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_vernacular_ai_kit-1.2.0-py3-none-any.whl:

Publisher: release.yml on SudhirGadhvi/open-vernacular-ai-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: open_vernacular_ai_kit-1.2.0-py3-none-any.whl
- Subject digest: 185bce72030a571d10a95c6f83c183675f1bbbd1059c184b46bec954c2b3f128
- Sigstore transparency entry: 1150105934
- Sigstore integration time: Mar 20, 2026
Source repository:
- Permalink: SudhirGadhvi/open-vernacular-ai-kit@2a6087cd2a9d18daa7ed2e226d611d984a3b73b6
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/SudhirGadhvi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2a6087cd2a9d18daa7ed2e226d611d984a3b73b6
- Trigger Event: push

open-vernacular-ai-kit 1.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Open Vernacular AI Kit

What We Solve

Developer Adoption Assets

Hard Cases (WhatsApp-Style)

Reproducible Eval (Gujarati Baseline)

North-Star Baseline Snapshot (Current Release)

Indian Language Coverage (This Release)

Contribute

Open-Source Governance

Repository Rename Notes

Install

CLI

API Service (FastAPI)

Docker Image

Demo (Streamlit)

Run On Localhost

Troubleshooting (Local Demo)

Demo Screenshots

1) Landing / value overview

2) Live analysis (Before -> After)

3) RAG section

4) Settings panel (expanded)

5) Token LID panel (expanded)

6) Code-switching + dialect panel (expanded)

7) Batch helpers panel (expanded)

How To Use The Demo

RAG Utilities (v0.5)

Dialects (Full SDK)

Training (Optional)

Disclaimer

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance