Skip to main content

Phonetic normalization for Hinglish text. Resolves spelling variation in Romanized Hindi using IPA as a bridge representation.

Project description

dhwani

Phonetic normalization for Hinglish text.

dhwani (ध्वनि = "sound") understands that "bahut", "bohot", "boht", and "bhot" are all the same word. It normalizes the chaos of Romanized Hindi into something computers can actually work with.

Why?

600M+ Indians write online in Hinglish (Hindi in Latin script mixed with English). But there's no standard spelling:

"बहुत" gets written as: bahut, bohot, boht, bhot, bahot
"अच्छा" gets written as: accha, achha, acha, achaa
"कैसे" gets written as: kaise, kese, kayse

Every NLP tool breaks on this. dhwani fixes it.

Install

pip install git+https://github.com/Kkoundinyaa/dhwani.git

For higher accuracy on rare words (optional):

pip install "dhwani[models] @ git+https://github.com/Kkoundinyaa/dhwani.git"

Usage

import dhwani

# Check if two words are the same (variant spellings)
dhwani.are_same("bahut", "bohot")   # True
dhwani.are_same("accha", "achha")   # True
dhwani.are_same("bahut", "accha")   # False

# Convert Hinglish to Devanagari
dhwani.to_devanagari("bohot accha movie thi yaar")
# -> "बहुत अच्छा movie थी यार"

# Convert to IPA (phonetic representation)
dhwani.to_ipa("kaise ho bhai")
# -> "kɛːseː ɦoː bʱaːiː"

# Word-level language identification
dhwani.identify_languages("ye movie really acchi thi bro")
# -> [("ye", "hi"), ("movie", "en"), ("really", "en"), ("acchi", "hi"), ("thi", "hi"), ("bro", "hi")]

# Normalize text
dhwani.normalize("bohot acha movie thi")
# -> canonical normalized form

CLI

dhwani devanagari "bohot accha movie thi yaar"
# बहुत अच्छा movie थी यार

dhwani ipa "kaise ho bhai"
# kɛːseː ɦoː bʱaːiː

dhwani same "bahut" "bohot"
# True (phonetic similarity: 1.00)

dhwani langs "ye movie bohot acchi thi"
# ye[hi] movie[en] bohot[hi] acchi[hi] thi[hi]

How It Works

dhwani routes through IPA (International Phonetic Alphabet) as a bridge representation. All variant spellings of a word produce the same sound, so they map to the same IPA:

"bahut" ─┐
"bohot" ─┤──> /bəɦʊt̪/ ──> बहुत
"boht"  ─┤
"bhot"  ─┘

Three-tier architecture for speed:

Tier Method Speed Coverage
1 Lexicon lookup (151K entries) 0.001ms ~95% of common words
2 AI model (IndicXlit + epitran) ~4s Handles anything
3 Rule-based G2P 0.005ms Always available

Plus a runtime cache that learns: words processed by Tier 2 get cached permanently, so the library gets faster over time.

Features

  • Phonetic equivalence: Detect if two words are the same regardless of spelling
  • Transliteration: Romanized Hindi to Devanagari (and back)
  • IPA conversion: Any Hindi text (Roman or Devanagari) to IPA
  • Language ID: Word-level Hindi/English classification in mixed text
  • Zero dependencies for basic use (lexicon + rules)
  • 151K-word lexicon built from real Hindi corpora
  • Runtime learning: Gets smarter the more you use it

Research

Built on findings from IPA-GPT research at Ohio State University, which showed that phonetic (IPA) representations dramatically improve cross-lingual NLP for script-divergent languages like Hindi-Urdu.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dhvani-0.2.0.tar.gz (10.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dhvani-0.2.0-py3-none-any.whl (10.1 MB view details)

Uploaded Python 3

File details

Details for the file dhvani-0.2.0.tar.gz.

File metadata

  • Download URL: dhvani-0.2.0.tar.gz
  • Upload date:
  • Size: 10.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.0.tar.gz
Algorithm Hash digest
SHA256 49d2519ad1da749d01b15172652eb7caf375a7c17aebe8cb988c2fb19c40ac07
MD5 365f3ddc56e3b637359b83cdc74b0cc9
BLAKE2b-256 8f17be427f767d7de79c0483cd028092d76479a9e1757fc145b546f38bf594c7

See more details on using hashes here.

File details

Details for the file dhvani-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dhvani-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for dhvani-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ef773887690bc7c7d2040e301d53f36e6105151c20b73e5c34b37746d58ff1a
MD5 969650b0011eec169f4c166d161c8165
BLAKE2b-256 02d84480704e7af357aa5faf78c964a40501efa16bc26cfcc8645bda69a24c0c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page