Skip to main content

Phonetic normalization for Hinglish text. Resolves spelling variation in Romanized Hindi using IPA as a bridge representation.

Project description

dhvani

Phonetic normalization for Hinglish text.

dhvani resolves the spelling chaos of Romanized Hindi. It knows that "bahut", "bohot", "boht", and "bhot" are all the same word, and normalizes them to a canonical form using IPA as a bridge representation.

pip install dhvani
import dhvani

dhvani.to_devanagari("bohotttt achaaa movie thi yaar")
# -> "बहुत अच्छा movie थी यार"

dhvani.are_same("bahut", "bohot")   # True
dhvani.are_same("bahut", "बहुत")    # True (cross-script)

dhvani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"

The Problem

600M+ Indians write online in Hinglish (Hindi in Latin script, mixed with English). There is no standardized spelling:

Word Variants typed online
बहुत (very) bahut, bohot, boht, bhot, bahot, bht, bhaut
अच्छा (good) accha, achha, acha, achaa, aacha
कैसे (how) kaise, kese, kayse, kse

This breaks search, sentiment analysis, content moderation, and every other NLP tool. dhvani fixes it.


Install

pip install dhvani

That's it. No model downloads, no API keys, no GPU needed. The 1M+ word lexicon ships with the package.


Usage

Transliteration

import dhvani

# Handles messy social media text
dhvani.to_devanagari("kya karra h tu")
# -> "क्या कर रहा है तू"

# Handles elongated text
dhvani.to_devanagari("bohotttt achaaa yaaaar")
# -> "बहुत अच्छा यार"

# Preserves English words and punctuation
dhvani.to_devanagari("the movie was really acchi thi!")
# -> "the movie was really अच्छी थी!"

Phonetic Matching

# Same word, different spellings
dhvani.are_same("bahut", "bohot")     # True
dhvani.are_same("theek", "tik")       # True
dhvani.are_same("yaar", "yr")         # True

# Cross-script matching
dhvani.are_same("bahut", "बहुत")      # True
dhvani.are_same("achaaaa", "अच्छा")   # True (handles elongation)

# Different words correctly rejected
dhvani.are_same("bahut", "accha")     # False

IPA Conversion

dhvani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"

dhvani.to_ipa("bahut accha")
# -> "bəɦʊt̪ ət͡ʃːʰaː"

Language Identification

dhvani.identify_languages("the movie was really acchi thi")
# -> [("the", "en"), ("movie", "en"), ("was", "en"),
#     ("really", "en"), ("acchi", "hi"), ("thi", "hi")]

# Context-aware: "are" resolves differently based on neighbors
dhvani.identify_languages("are you kidding me")
# -> all English

dhvani.identify_languages("are bhai kya kar raha hai")
# -> "are" tagged as Hindi (अरे)

CLI

dhvani devanagari "bohot accha movie thi yaar"
# बहुत अच्छा movie थी यार

dhvani ipa "kaise ho bhai"
# kɛːseː ɦoː bʱaːiː

dhvani same "bahut" "bohot"
# True (similarity: 1.00)

How It Works

All variant spellings of a Hindi word produce the same sound. dhvani uses IPA (International Phonetic Alphabet) as a universal bridge:

"bahut"  ─┐
"bohot"  ─┤
"boht"   ─┼──> /bəɦʊt̪/ ──> बहुत
"bhot"   ─┤
"bahotttt"─┘

Architecture

Tier Method Latency When used
1 Lexicon lookup (1M+ entries) <1ms ~99% of words
2 AI model (IndicXlit + epitran) ~4s Rare/novel words
3 Rule-based G2P <1ms Fallback (no deps)

The lexicon was built from Hindi Wikipedia (50K articles), IITB parallel corpus (500K sentences), and MASSIVE/XNLI datasets, generating 10 romanized spelling variants per word via IPA-to-Roman rules.

Preprocessing Pipeline

Before lookup, input goes through:

  1. Punctuation stripping (preserved and reattached after conversion)
  2. Repeated character collapsing ("bohotttt" -> "bohot")
  3. Double consonant fallback (tries collapsed form if double misses)
  4. Context-aware language ID (disambiguates words like "are", "the", "bus")

Use Cases

Search & Retrieval -- Index Hinglish content once, find it regardless of spelling. A search for "accha" finds posts containing "achha", "acha", "achaa".

Sentiment Analysis -- Normalize text before classification. Spelling variants of sentiment words ("bakwas", "bakwaas", "bakwass") all resolve to the same form.

Content Moderation -- Detect abusive content regardless of spelling obfuscation.

Preprocessing for LLMs -- Reduce vocabulary size and improve tokenization for Hindi/Hinglish fine-tuning.


Performance

  • 1,072,153 lexicon entries
  • <1ms per word (lexicon hit)
  • ~2s cold start (lexicon load), then instant
  • No model needed at inference (pure lookup + rules)
  • Tested on Cardiff Hindi Tweet Sentiment dataset: +1.2% macro F1 improvement over raw text

Research

Built on findings from IPA-GPT research at Ohio State University, which demonstrated that phonetic (IPA) representations enable significant cross-lingual transfer improvements for script-divergent languages like Hindi-Urdu.


License

MIT


Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dhvani-0.2.4.tar.gz (10.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dhvani-0.2.4-py3-none-any.whl (10.1 MB view details)

Uploaded Python 3

File details

Details for the file dhvani-0.2.4.tar.gz.

File metadata

  • Download URL: dhvani-0.2.4.tar.gz
  • Upload date:
  • Size: 10.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.4.tar.gz
Algorithm Hash digest
SHA256 350f6903c83db50d1fb1604e2c34a45d3d56a42bc46bb7f586dc5f5bb4b5cde2
MD5 adf99fbb3ae908806b9cf4e5265241c5
BLAKE2b-256 b3d6689fba4de4a91176d3a0b935d2ce2912422283369ae90e3596fc0c1af235

See more details on using hashes here.

File details

Details for the file dhvani-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: dhvani-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 10.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2d713fd9cde8f7656a1d055fbcbaa0846d72d571dd49f6e6f162063696e42b0c
MD5 79f0da00462862c7439c21a95a068159
BLAKE2b-256 15a703877ff96c6719f64a2c1673abdff8f4be10377895c6f40885dcd2d77cf0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page