Skip to main content

Phonetic normalization for Hinglish text. Resolves spelling variation in Romanized Hindi using IPA as a bridge representation.

Project description

dhvani

Phonetic normalization for Hinglish text.

dhvani resolves the spelling chaos of Romanized Hindi. It knows that "bahut", "bohot", "boht", and "bhot" are all the same word, and normalizes them to a canonical form using IPA as a bridge representation.

pip install dhvani
import dhvani

dhvani.to_devanagari("bohotttt achaaa movie thi yaar")
# -> "बहुत अच्छा movie थी यार"

dhvani.are_same("bahut", "bohot")   # True
dhvani.are_same("bahut", "बहुत")    # True (cross-script)

dhvani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"

The Problem

600M+ Indians write online in Hinglish (Hindi in Latin script, mixed with English). There is no standardized spelling:

Word Variants typed online
बहुत (very) bahut, bohot, boht, bhot, bahot, bht, bhaut
अच्छा (good) accha, achha, acha, achaa, aacha
कैसे (how) kaise, kese, kayse, kse

This breaks search, sentiment analysis, content moderation, and every other NLP tool. dhvani fixes it.


Install

pip install dhvani

That's it. No model downloads, no API keys, no GPU needed. The 1M+ word lexicon ships with the package.


Usage

Transliteration

import dhvani

# Handles messy social media text
dhvani.to_devanagari("kya karra h tu")
# -> "क्या कर रहा है तू"

# Handles elongated text
dhvani.to_devanagari("bohotttt achaaa yaaaar")
# -> "बहुत अच्छा यार"

# Preserves English words and punctuation
dhvani.to_devanagari("the movie was really acchi thi!")
# -> "the movie was really अच्छी थी!"

Phonetic Matching

# Same word, different spellings
dhvani.are_same("bahut", "bohot")     # True
dhvani.are_same("theek", "tik")       # True
dhvani.are_same("yaar", "yr")         # True

# Cross-script matching
dhvani.are_same("bahut", "बहुत")      # True
dhvani.are_same("achaaaa", "अच्छा")   # True (handles elongation)

# Different words correctly rejected
dhvani.are_same("bahut", "accha")     # False

IPA Conversion

dhvani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"

dhvani.to_ipa("bahut accha")
# -> "bəɦʊt̪ ət͡ʃːʰaː"

Language Identification

dhvani.identify_languages("the movie was really acchi thi")
# -> [("the", "en"), ("movie", "en"), ("was", "en"),
#     ("really", "en"), ("acchi", "hi"), ("thi", "hi")]

# Context-aware: "are" resolves differently based on neighbors
dhvani.identify_languages("are you kidding me")
# -> all English

dhvani.identify_languages("are bhai kya kar raha hai")
# -> "are" tagged as Hindi (अरे)

CLI

dhvani devanagari "bohot accha movie thi yaar"
# बहुत अच्छा movie थी यार

dhvani ipa "kaise ho bhai"
# kɛːseː ɦoː bʱaːiː

dhvani same "bahut" "bohot"
# True (similarity: 1.00)

How It Works

All variant spellings of a Hindi word produce the same sound. dhvani uses IPA (International Phonetic Alphabet) as a universal bridge:

"bahut"  ─┐
"bohot"  ─┤
"boht"   ─┼──> /bəɦʊt̪/ ──> बहुत
"bhot"   ─┤
"bahotttt"─┘

Architecture

Tier Method Latency When used
1 Lexicon lookup (1M+ entries) <1ms ~99% of words
2 AI model (IndicXlit + epitran) ~4s Rare/novel words
3 Rule-based G2P <1ms Fallback (no deps)

The lexicon was built from Hindi Wikipedia (50K articles), IITB parallel corpus (500K sentences), and MASSIVE/XNLI datasets, generating 10 romanized spelling variants per word via IPA-to-Roman rules.

Preprocessing Pipeline

Before lookup, input goes through:

  1. Punctuation stripping (preserved and reattached after conversion)
  2. Repeated character collapsing ("bohotttt" -> "bohot")
  3. Double consonant fallback (tries collapsed form if double misses)
  4. Context-aware language ID (disambiguates words like "are", "the", "bus")

Use Cases

Search & Retrieval -- Index Hinglish content once, find it regardless of spelling. A search for "accha" finds posts containing "achha", "acha", "achaa".

Sentiment Analysis -- Normalize text before classification. Spelling variants of sentiment words ("bakwas", "bakwaas", "bakwass") all resolve to the same form.

Content Moderation -- Detect abusive content regardless of spelling obfuscation.

Preprocessing for LLMs -- Reduce vocabulary size and improve tokenization for Hindi/Hinglish fine-tuning.


Performance

  • 1,072,153 lexicon entries
  • <1ms per word (lexicon hit)
  • ~2s cold start (lexicon load), then instant
  • No model needed at inference (pure lookup + rules)
  • Tested on Cardiff Hindi Tweet Sentiment dataset: +1.2% macro F1 improvement over raw text

Research

Built on findings from IPA-GPT research at Ohio State University, which demonstrated that phonetic (IPA) representations enable significant cross-lingual transfer improvements for script-divergent languages like Hindi-Urdu.


License

MIT


Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dhvani-0.2.3.tar.gz (10.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dhvani-0.2.3-py3-none-any.whl (10.1 MB view details)

Uploaded Python 3

File details

Details for the file dhvani-0.2.3.tar.gz.

File metadata

  • Download URL: dhvani-0.2.3.tar.gz
  • Upload date:
  • Size: 10.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.3.tar.gz
Algorithm Hash digest
SHA256 2111bbcf6aa2eeeeb2be491c5a7147135d1741a8133c5979d08bef667f3fb59c
MD5 bf9753e20dff927b73066da37984098c
BLAKE2b-256 c99fe7afec462554903fd4845c1d819aa3b79670c4461123e0ea9bfd7fc5783c

See more details on using hashes here.

File details

Details for the file dhvani-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: dhvani-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 10.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8eb1736bed2002cc1964eef56c757886e8de7cb651caedcf3da2386517c5830f
MD5 55627ee7aff507af81acebdf9502c453
BLAKE2b-256 e7c6852a3b466900fdf0a2a29ba12a648c966680ae1c477886af2069a7f2d607

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page