Skip to main content

Phonetic normalization for Hinglish text. Resolves spelling variation in Romanized Hindi using IPA as a bridge representation.

Project description

dhvani

Phonetic normalization for Hinglish text.

dhvani resolves the spelling chaos of Romanized Hindi. It knows that "bahut", "bohot", "boht", and "bhot" are all the same word, and normalizes them to a canonical form using IPA as a bridge representation.

pip install dhvani
import dhvani

dhvani.to_devanagari("bohotttt achaaa movie thi yaar")
# -> "बहुत अच्छा movie थी यार"

dhvani.are_same("bahut", "bohot")   # True
dhvani.are_same("bahut", "बहुत")    # True (cross-script)

dhvani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"

The Problem

600M+ Indians write online in Hinglish (Hindi in Latin script, mixed with English). There is no standardized spelling:

Word Variants typed online
बहुत (very) bahut, bohot, boht, bhot, bahot, bht, bhaut
अच्छा (good) accha, achha, acha, achaa, aacha
कैसे (how) kaise, kese, kayse, kse

This breaks search, sentiment analysis, content moderation, and every other NLP tool. dhvani fixes it.


Install

pip install dhvani

That's it. No model downloads, no API keys, no GPU needed. The 1M+ word lexicon ships with the package.


Usage

Transliteration

import dhvani

# Handles messy social media text
dhvani.to_devanagari("kya karra h tu")
# -> "क्या कर रहा है तू"

# Handles elongated text
dhvani.to_devanagari("bohotttt achaaa yaaaar")
# -> "बहुत अच्छा यार"

# Preserves English words and punctuation
dhvani.to_devanagari("the movie was really acchi thi!")
# -> "the movie was really अच्छी थी!"

Phonetic Matching

# Same word, different spellings
dhvani.are_same("bahut", "bohot")     # True
dhvani.are_same("theek", "tik")       # True
dhvani.are_same("yaar", "yr")         # True

# Cross-script matching
dhvani.are_same("bahut", "बहुत")      # True
dhvani.are_same("achaaaa", "अच्छा")   # True (handles elongation)

# Different words correctly rejected
dhvani.are_same("bahut", "accha")     # False

IPA Conversion

dhvani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"

dhvani.to_ipa("bahut accha")
# -> "bəɦʊt̪ ət͡ʃːʰaː"

Language Identification

dhvani.identify_languages("the movie was really acchi thi")
# -> [("the", "en"), ("movie", "en"), ("was", "en"),
#     ("really", "en"), ("acchi", "hi"), ("thi", "hi")]

# Context-aware: "are" resolves differently based on neighbors
dhvani.identify_languages("are you kidding me")
# -> all English

dhvani.identify_languages("are bhai kya kar raha hai")
# -> "are" tagged as Hindi (अरे)

CLI

dhvani devanagari "bohot accha movie thi yaar"
# बहुत अच्छा movie थी यार

dhvani ipa "kaise ho bhai"
# kɛːseː ɦoː bʱaːiː

dhvani same "bahut" "bohot"
# True (similarity: 1.00)

How It Works

All variant spellings of a Hindi word produce the same sound. dhvani uses IPA (International Phonetic Alphabet) as a universal bridge:

"bahut"  ─┐
"bohot"  ─┤
"boht"   ─┼──> /bəɦʊt̪/ ──> बहुत
"bhot"   ─┤
"bahotttt"─┘

Architecture

Tier Method Latency When used
1 Lexicon lookup (1M+ entries) <1ms ~99% of words
2 AI model (IndicXlit + epitran) ~4s Rare/novel words
3 Rule-based G2P <1ms Fallback (no deps)

The lexicon was built from Hindi Wikipedia (50K articles), IITB parallel corpus (500K sentences), and MASSIVE/XNLI datasets, generating 10 romanized spelling variants per word via IPA-to-Roman rules.

Preprocessing Pipeline

Before lookup, input goes through:

  1. Punctuation stripping (preserved and reattached after conversion)
  2. Repeated character collapsing ("bohotttt" -> "bohot")
  3. Double consonant fallback (tries collapsed form if double misses)
  4. Context-aware language ID (disambiguates words like "are", "the", "bus")

Use Cases

Search & Retrieval -- Index Hinglish content once, find it regardless of spelling. A search for "accha" finds posts containing "achha", "acha", "achaa".

Sentiment Analysis -- Normalize text before classification. Spelling variants of sentiment words ("bakwas", "bakwaas", "bakwass") all resolve to the same form.

Content Moderation -- Detect abusive content regardless of spelling obfuscation.

Preprocessing for LLMs -- Reduce vocabulary size and improve tokenization for Hindi/Hinglish fine-tuning.


Performance

  • 1,072,153 lexicon entries
  • <1ms per word (lexicon hit)
  • ~2s cold start (lexicon load), then instant
  • No model needed at inference (pure lookup + rules)
  • Tested on Cardiff Hindi Tweet Sentiment dataset: +1.2% macro F1 improvement over raw text

Research

Built on findings from IPA-GPT research at Ohio State University, which demonstrated that phonetic (IPA) representations enable significant cross-lingual transfer improvements for script-divergent languages like Hindi-Urdu.


License

MIT


Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dhvani-0.2.2.tar.gz (10.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dhvani-0.2.2-py3-none-any.whl (10.1 MB view details)

Uploaded Python 3

File details

Details for the file dhvani-0.2.2.tar.gz.

File metadata

  • Download URL: dhvani-0.2.2.tar.gz
  • Upload date:
  • Size: 10.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.2.tar.gz
Algorithm Hash digest
SHA256 4fe29c90f3680d3620c13d08b95cbef8ab293881cbae7a373014dd730d0775e5
MD5 cbb6082660b30bdba04bca0be239b38f
BLAKE2b-256 87d94b9a90975e18f6c99e8c0f39b422d12c627d57ab1a61cf5ec7e72fbe7416

See more details on using hashes here.

File details

Details for the file dhvani-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: dhvani-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 10.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 34b66da2154d106958ae24565f1272ead26788ee1fe5391db2607004a193b1bc
MD5 b0aaefb14f731003e25c0d73a0d2520c
BLAKE2b-256 279d1740e656e36aaed77109c92edffa8b503c1636e961a7244dd1efeb8a30c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page