Phonetic normalization for Hinglish text. Resolves spelling variation in Romanized Hindi using IPA as a bridge representation.
Project description
dhwani
Phonetic normalization for Hinglish text.
dhwani (ध्वनि = "sound") understands that "bahut", "bohot", "boht", and "bhot" are all the same word. It normalizes the chaos of Romanized Hindi into something computers can actually work with.
Why?
600M+ Indians write online in Hinglish (Hindi in Latin script mixed with English). But there's no standard spelling:
"बहुत" gets written as: bahut, bohot, boht, bhot, bahot
"अच्छा" gets written as: accha, achha, acha, achaa
"कैसे" gets written as: kaise, kese, kayse
Every NLP tool breaks on this. dhwani fixes it.
Install
pip install git+https://github.com/Kkoundinyaa/dhwani.git
For higher accuracy on rare words (optional):
pip install "dhwani[models] @ git+https://github.com/Kkoundinyaa/dhwani.git"
Usage
import dhwani
# Check if two words are the same (variant spellings)
dhwani.are_same("bahut", "bohot") # True
dhwani.are_same("accha", "achha") # True
dhwani.are_same("bahut", "accha") # False
# Convert Hinglish to Devanagari
dhwani.to_devanagari("bohot accha movie thi yaar")
# -> "बहुत अच्छा movie थी यार"
# Convert to IPA (phonetic representation)
dhwani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"
# Word-level language identification
dhwani.identify_languages("ye movie really acchi thi bro")
# -> [("ye", "hi"), ("movie", "en"), ("really", "en"), ("acchi", "hi"), ("thi", "hi"), ("bro", "hi")]
# Normalize text
dhwani.normalize("bohot acha movie thi")
# -> canonical normalized form
CLI
dhwani devanagari "bohot accha movie thi yaar"
# बहुत अच्छा movie थी यार
dhwani ipa "kaise ho bhai"
# kɛːseː ɦoː bʱaːiː
dhwani same "bahut" "bohot"
# True (phonetic similarity: 1.00)
dhwani langs "ye movie bohot acchi thi"
# ye[hi] movie[en] bohot[hi] acchi[hi] thi[hi]
How It Works
dhwani routes through IPA (International Phonetic Alphabet) as a bridge representation. All variant spellings of a word produce the same sound, so they map to the same IPA:
"bahut" ─┐
"bohot" ─┤──> /bəɦʊt̪/ ──> बहुत
"boht" ─┤
"bhot" ─┘
Three-tier architecture for speed:
| Tier | Method | Speed | Coverage |
|---|---|---|---|
| 1 | Lexicon lookup (151K entries) | 0.001ms | ~95% of common words |
| 2 | AI model (IndicXlit + epitran) | ~4s | Handles anything |
| 3 | Rule-based G2P | 0.005ms | Always available |
Plus a runtime cache that learns: words processed by Tier 2 get cached permanently, so the library gets faster over time.
Features
- Phonetic equivalence: Detect if two words are the same regardless of spelling
- Transliteration: Romanized Hindi to Devanagari (and back)
- IPA conversion: Any Hindi text (Roman or Devanagari) to IPA
- Language ID: Word-level Hindi/English classification in mixed text
- Zero dependencies for basic use (lexicon + rules)
- 151K-word lexicon built from real Hindi corpora
- Runtime learning: Gets smarter the more you use it
Research
Built on findings from IPA-GPT research at Ohio State University, which showed that phonetic (IPA) representations dramatically improve cross-lingual NLP for script-divergent languages like Hindi-Urdu.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dhvani-0.2.1.tar.gz.
File metadata
- Download URL: dhvani-0.2.1.tar.gz
- Upload date:
- Size: 10.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76690537b99162b67f9768855bbb8869d0fd066f473718211bba9a1ba11a59ef
|
|
| MD5 |
6a3b6c41077bdbaa522fb5d983f9b5d1
|
|
| BLAKE2b-256 |
c4bc5253b18dbb45642fd7a6e5adc6df8150a22a6f15cb7ed84f037c07344e6e
|
File details
Details for the file dhvani-0.2.1-py3-none-any.whl.
File metadata
- Download URL: dhvani-0.2.1-py3-none-any.whl
- Upload date:
- Size: 10.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b55d9912278f8bf09636935bb0124ae950df95bd176f7b8060629457e8c56b53
|
|
| MD5 |
557b7ca743ea584a95e64ebf00d601e7
|
|
| BLAKE2b-256 |
40430a980e20c6ef4819b7126e8f0fa347d441fee473782aca33b8a7fe23199a
|