Skip to main content

Phonetic normalization for Hinglish text. Resolves spelling variation in Romanized Hindi using IPA as a bridge representation.

Project description

dhwani

Phonetic normalization for Hinglish text.

dhwani (ध्वनि = "sound") understands that "bahut", "bohot", "boht", and "bhot" are all the same word. It normalizes the chaos of Romanized Hindi into something computers can actually work with.

Why?

600M+ Indians write online in Hinglish (Hindi in Latin script mixed with English). But there's no standard spelling:

"बहुत" gets written as: bahut, bohot, boht, bhot, bahot
"अच्छा" gets written as: accha, achha, acha, achaa
"कैसे" gets written as: kaise, kese, kayse

Every NLP tool breaks on this. dhwani fixes it.

Install

pip install git+https://github.com/Kkoundinyaa/dhwani.git

For higher accuracy on rare words (optional):

pip install "dhwani[models] @ git+https://github.com/Kkoundinyaa/dhwani.git"

Usage

import dhwani

# Check if two words are the same (variant spellings)
dhwani.are_same("bahut", "bohot")   # True
dhwani.are_same("accha", "achha")   # True
dhwani.are_same("bahut", "accha")   # False

# Convert Hinglish to Devanagari
dhwani.to_devanagari("bohot accha movie thi yaar")
# -> "बहुत अच्छा movie थी यार"

# Convert to IPA (phonetic representation)
dhwani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"

# Word-level language identification
dhwani.identify_languages("ye movie really acchi thi bro")
# -> [("ye", "hi"), ("movie", "en"), ("really", "en"), ("acchi", "hi"), ("thi", "hi"), ("bro", "hi")]

# Normalize text
dhwani.normalize("bohot acha movie thi")
# -> canonical normalized form

CLI

dhwani devanagari "bohot accha movie thi yaar"
# बहुत अच्छा movie थी यार

dhwani ipa "kaise ho bhai"
# kɛːseː ɦoː bʱaːiː

dhwani same "bahut" "bohot"
# True (phonetic similarity: 1.00)

dhwani langs "ye movie bohot acchi thi"
# ye[hi] movie[en] bohot[hi] acchi[hi] thi[hi]

How It Works

dhwani routes through IPA (International Phonetic Alphabet) as a bridge representation. All variant spellings of a word produce the same sound, so they map to the same IPA:

"bahut" ─┐
"bohot" ─┤──> /bəɦʊt̪/ ──> बहुत
"boht"  ─┤
"bhot"  ─┘

Three-tier architecture for speed:

Tier Method Speed Coverage
1 Lexicon lookup (151K entries) 0.001ms ~95% of common words
2 AI model (IndicXlit + epitran) ~4s Handles anything
3 Rule-based G2P 0.005ms Always available

Plus a runtime cache that learns: words processed by Tier 2 get cached permanently, so the library gets faster over time.

Features

  • Phonetic equivalence: Detect if two words are the same regardless of spelling
  • Transliteration: Romanized Hindi to Devanagari (and back)
  • IPA conversion: Any Hindi text (Roman or Devanagari) to IPA
  • Language ID: Word-level Hindi/English classification in mixed text
  • Zero dependencies for basic use (lexicon + rules)
  • 151K-word lexicon built from real Hindi corpora
  • Runtime learning: Gets smarter the more you use it

Research

Built on findings from IPA-GPT research at Ohio State University, which showed that phonetic (IPA) representations dramatically improve cross-lingual NLP for script-divergent languages like Hindi-Urdu.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dhvani-0.2.1.tar.gz (10.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dhvani-0.2.1-py3-none-any.whl (10.1 MB view details)

Uploaded Python 3

File details

Details for the file dhvani-0.2.1.tar.gz.

File metadata

  • Download URL: dhvani-0.2.1.tar.gz
  • Upload date:
  • Size: 10.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.1.tar.gz
Algorithm Hash digest
SHA256 76690537b99162b67f9768855bbb8869d0fd066f473718211bba9a1ba11a59ef
MD5 6a3b6c41077bdbaa522fb5d983f9b5d1
BLAKE2b-256 c4bc5253b18dbb45642fd7a6e5adc6df8150a22a6f15cb7ed84f037c07344e6e

See more details on using hashes here.

File details

Details for the file dhvani-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: dhvani-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 10.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b55d9912278f8bf09636935bb0124ae950df95bd176f7b8060629457e8c56b53
MD5 557b7ca743ea584a95e64ebf00d601e7
BLAKE2b-256 40430a980e20c6ef4819b7126e8f0fa347d441fee473782aca33b8a7fe23199a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page