Phonetic normalization for Hinglish text. Resolves spelling variation in Romanized Hindi using IPA as a bridge representation.

These details have not been verified by PyPI

Project links

Project description

dhvani

Phonetic normalization for Hinglish text.

dhvani resolves the spelling chaos of Romanized Hindi. It knows that "bahut", "bohot", "boht", and "bhot" are all the same word, and normalizes them to a canonical form using IPA as a bridge representation.

pip install dhvani

import dhvani

dhvani.to_devanagari("bohotttt achaaa movie thi yaar")
# -> "बहुत अच्छा movie थी यार"

dhvani.are_same("bahut", "bohot")   # True
dhvani.are_same("bahut", "बहुत")    # True (cross-script)

dhvani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"

The Problem

600M+ Indians write online in Hinglish (Hindi in Latin script, mixed with English). There is no standardized spelling:

Word	Variants typed online
बहुत (very)	bahut, bohot, boht, bhot, bahot, bht, bhaut
अच्छा (good)	accha, achha, acha, achaa, aacha
कैसे (how)	kaise, kese, kayse, kse

This breaks search, sentiment analysis, content moderation, and every other NLP tool. dhvani fixes it.

Install

pip install dhvani

That's it. No model downloads, no API keys, no GPU needed. The 1M+ word lexicon ships with the package.

Usage

Transliteration

import dhvani

# Handles messy social media text
dhvani.to_devanagari("kya karra h tu")
# -> "क्या कर रहा है तू"

# Handles elongated text
dhvani.to_devanagari("bohotttt achaaa yaaaar")
# -> "बहुत अच्छा यार"

# Preserves English words and punctuation
dhvani.to_devanagari("the movie was really acchi thi!")
# -> "the movie was really अच्छी थी!"

Phonetic Matching

# Same word, different spellings
dhvani.are_same("bahut", "bohot")     # True
dhvani.are_same("theek", "tik")       # True
dhvani.are_same("yaar", "yr")         # True

# Cross-script matching
dhvani.are_same("bahut", "बहुत")      # True
dhvani.are_same("achaaaa", "अच्छा")   # True (handles elongation)

# Different words correctly rejected
dhvani.are_same("bahut", "accha")     # False

IPA Conversion

dhvani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"

dhvani.to_ipa("bahut accha")
# -> "bəɦʊt̪ ət͡ʃːʰaː"

Language Identification

dhvani.identify_languages("the movie was really acchi thi")
# -> [("the", "en"), ("movie", "en"), ("was", "en"),
#     ("really", "en"), ("acchi", "hi"), ("thi", "hi")]

# Context-aware: "are" resolves differently based on neighbors
dhvani.identify_languages("are you kidding me")
# -> all English

dhvani.identify_languages("are bhai kya kar raha hai")
# -> "are" tagged as Hindi (अरे)

CLI

dhvani devanagari "bohot accha movie thi yaar"
# बहुत अच्छा movie थी यार

dhvani ipa "kaise ho bhai"
# kɛːseː ɦoː bʱaːiː

dhvani same "bahut" "bohot"
# True (similarity: 1.00)

How It Works

All variant spellings of a Hindi word produce the same sound. dhvani uses IPA (International Phonetic Alphabet) as a universal bridge:

"bahut"  ─┐
"bohot"  ─┤
"boht"   ─┼──> /bəɦʊt̪/ ──> बहुत
"bhot"   ─┤
"bahotttt"─┘

Architecture

Tier	Method	Latency	When used
1	Lexicon lookup (1M+ entries)	<1ms	~99% of words
2	AI model (IndicXlit + epitran)	~4s	Rare/novel words
3	Rule-based G2P	<1ms	Fallback (no deps)

The lexicon was built from Hindi Wikipedia (50K articles), IITB parallel corpus (500K sentences), and MASSIVE/XNLI datasets, generating 10 romanized spelling variants per word via IPA-to-Roman rules.

Preprocessing Pipeline

Before lookup, input goes through:

Punctuation stripping (preserved and reattached after conversion)
Repeated character collapsing ("bohotttt" -> "bohot")
Double consonant fallback (tries collapsed form if double misses)
Context-aware language ID (disambiguates words like "are", "the", "bus")

Use Cases

Search & Retrieval -- Index Hinglish content once, find it regardless of spelling. A search for "accha" finds posts containing "achha", "acha", "achaa".

Sentiment Analysis -- Normalize text before classification. Spelling variants of sentiment words ("bakwas", "bakwaas", "bakwass") all resolve to the same form.

Content Moderation -- Detect abusive content regardless of spelling obfuscation.

Preprocessing for LLMs -- Reduce vocabulary size and improve tokenization for Hindi/Hinglish fine-tuning.

Performance

1,072,153 lexicon entries
<1ms per word (lexicon hit)
~2s cold start (lexicon load), then instant
No model needed at inference (pure lookup + rules)
Tested on Cardiff Hindi Tweet Sentiment dataset: +1.2% macro F1 improvement over raw text

Research

Built on findings from IPA-GPT research at Ohio State University, which demonstrated that phonetic (IPA) representations enable significant cross-lingual transfer improvements for script-divergent languages like Hindi-Urdu.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.5

May 11, 2026

0.2.4

May 7, 2026

0.2.3

May 6, 2026

This version

0.2.2

May 3, 2026

0.2.1

May 3, 2026

0.2.0

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dhvani-0.2.2.tar.gz (10.0 MB view details)

Uploaded May 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dhvani-0.2.2-py3-none-any.whl (10.1 MB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file dhvani-0.2.2.tar.gz.

File metadata

Download URL: dhvani-0.2.2.tar.gz
Upload date: May 3, 2026
Size: 10.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`4fe29c90f3680d3620c13d08b95cbef8ab293881cbae7a373014dd730d0775e5`
MD5	`cbb6082660b30bdba04bca0be239b38f`
BLAKE2b-256	`87d94b9a90975e18f6c99e8c0f39b422d12c627d57ab1a61cf5ec7e72fbe7416`

See more details on using hashes here.

File details

Details for the file dhvani-0.2.2-py3-none-any.whl.

File metadata

Download URL: dhvani-0.2.2-py3-none-any.whl
Upload date: May 3, 2026
Size: 10.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`34b66da2154d106958ae24565f1272ead26788ee1fe5391db2607004a193b1bc`
MD5	`b0aaefb14f731003e25c0d73a0d2520c`
BLAKE2b-256	`279d1740e656e36aaed77109c92edffa8b503c1636e961a7244dd1efeb8a30c0`

See more details on using hashes here.

dhvani 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dhvani

The Problem

Install

Usage

Transliteration

Phonetic Matching

IPA Conversion

Language Identification

CLI

How It Works

Architecture

Preprocessing Pipeline

Use Cases

Performance

Research

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes