Phonetic normalization for Hinglish text. Resolves spelling variation in Romanized Hindi using IPA as a bridge representation.
Project description
dhvani
Phonetic normalization for Hinglish text.
dhvani resolves the spelling chaos of Romanized Hindi. It knows that "bahut", "bohot", "boht", and "bhot" are all the same word, and normalizes them to a canonical form using IPA as a bridge representation.
pip install dhvani
import dhvani
dhvani.to_devanagari("bohotttt achaaa movie thi yaar")
# -> "बहुत अच्छा movie थी यार"
dhvani.are_same("bahut", "bohot") # True
dhvani.are_same("bahut", "बहुत") # True (cross-script)
dhvani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"
The Problem
600M+ Indians write online in Hinglish (Hindi in Latin script, mixed with English). There is no standardized spelling:
| Word | Variants typed online |
|---|---|
| बहुत (very) | bahut, bohot, boht, bhot, bahot, bht, bhaut |
| अच्छा (good) | accha, achha, acha, achaa, aacha |
| कैसे (how) | kaise, kese, kayse, kse |
This breaks search, sentiment analysis, content moderation, and every other NLP tool. dhvani fixes it.
Install
pip install dhvani
That's it. No model downloads, no API keys, no GPU needed. The 1M+ word lexicon ships with the package.
Usage
Transliteration
import dhvani
# Handles messy social media text
dhvani.to_devanagari("kya karra h tu")
# -> "क्या कर रहा है तू"
# Handles elongated text
dhvani.to_devanagari("bohotttt achaaa yaaaar")
# -> "बहुत अच्छा यार"
# Preserves English words and punctuation
dhvani.to_devanagari("the movie was really acchi thi!")
# -> "the movie was really अच्छी थी!"
Phonetic Matching
# Same word, different spellings
dhvani.are_same("bahut", "bohot") # True
dhvani.are_same("theek", "tik") # True
dhvani.are_same("yaar", "yr") # True
# Cross-script matching
dhvani.are_same("bahut", "बहुत") # True
dhvani.are_same("achaaaa", "अच्छा") # True (handles elongation)
# Different words correctly rejected
dhvani.are_same("bahut", "accha") # False
IPA Conversion
dhvani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"
dhvani.to_ipa("bahut accha")
# -> "bəɦʊt̪ ət͡ʃːʰaː"
Language Identification
dhvani.identify_languages("the movie was really acchi thi")
# -> [("the", "en"), ("movie", "en"), ("was", "en"),
# ("really", "en"), ("acchi", "hi"), ("thi", "hi")]
# Context-aware: "are" resolves differently based on neighbors
dhvani.identify_languages("are you kidding me")
# -> all English
dhvani.identify_languages("are bhai kya kar raha hai")
# -> "are" tagged as Hindi (अरे)
CLI
dhvani devanagari "bohot accha movie thi yaar"
# बहुत अच्छा movie थी यार
dhvani ipa "kaise ho bhai"
# kɛːseː ɦoː bʱaːiː
dhvani same "bahut" "bohot"
# True (similarity: 1.00)
How It Works
All variant spellings of a Hindi word produce the same sound. dhvani uses IPA (International Phonetic Alphabet) as a universal bridge:
"bahut" ─┐
"bohot" ─┤
"boht" ─┼──> /bəɦʊt̪/ ──> बहुत
"bhot" ─┤
"bahotttt"─┘
Architecture
| Tier | Method | Latency | When used |
|---|---|---|---|
| 1 | Lexicon lookup (1M+ entries) | <1ms | ~99% of words |
| 2 | AI model (IndicXlit + epitran) | ~4s | Rare/novel words |
| 3 | Rule-based G2P | <1ms | Fallback (no deps) |
The lexicon was built from Hindi Wikipedia (50K articles), IITB parallel corpus (500K sentences), and MASSIVE/XNLI datasets, generating 10 romanized spelling variants per word via IPA-to-Roman rules.
Preprocessing Pipeline
Before lookup, input goes through:
- Punctuation stripping (preserved and reattached after conversion)
- Repeated character collapsing ("bohotttt" -> "bohot")
- Double consonant fallback (tries collapsed form if double misses)
- Context-aware language ID (disambiguates words like "are", "the", "bus")
Use Cases
Search & Retrieval -- Index Hinglish content once, find it regardless of spelling. A search for "accha" finds posts containing "achha", "acha", "achaa".
Sentiment Analysis -- Normalize text before classification. Spelling variants of sentiment words ("bakwas", "bakwaas", "bakwass") all resolve to the same form.
Content Moderation -- Detect abusive content regardless of spelling obfuscation.
Preprocessing for LLMs -- Reduce vocabulary size and improve tokenization for Hindi/Hinglish fine-tuning.
Performance
- 1,072,153 lexicon entries
- <1ms per word (lexicon hit)
- ~2s cold start (lexicon load), then instant
- No model needed at inference (pure lookup + rules)
- Tested on Cardiff Hindi Tweet Sentiment dataset: +1.2% macro F1 improvement over raw text
Research
Built on findings from IPA-GPT research at Ohio State University, which demonstrated that phonetic (IPA) representations enable significant cross-lingual transfer improvements for script-divergent languages like Hindi-Urdu.
License
MIT
Links
- PyPI: dhvani
- GitHub: Kkoundinyaa/dhwani
- Author: Krishna Badikela, Ohio State University
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dhvani-0.2.3.tar.gz.
File metadata
- Download URL: dhvani-0.2.3.tar.gz
- Upload date:
- Size: 10.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2111bbcf6aa2eeeeb2be491c5a7147135d1741a8133c5979d08bef667f3fb59c
|
|
| MD5 |
bf9753e20dff927b73066da37984098c
|
|
| BLAKE2b-256 |
c99fe7afec462554903fd4845c1d819aa3b79670c4461123e0ea9bfd7fc5783c
|
File details
Details for the file dhvani-0.2.3-py3-none-any.whl.
File metadata
- Download URL: dhvani-0.2.3-py3-none-any.whl
- Upload date:
- Size: 10.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8eb1736bed2002cc1964eef56c757886e8de7cb651caedcf3da2386517c5830f
|
|
| MD5 |
55627ee7aff507af81acebdf9502c453
|
|
| BLAKE2b-256 |
e7c6852a3b466900fdf0a2a29ba12a648c966680ae1c477886af2069a7f2d607
|