Transliterations to/from Indian languages

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Indicate: Transliterate Indic Languages with PyTorch and LLMs

Indicate provides high-quality transliteration between Indic languages and English using both a traditional PyTorch model and state-of-the-art LLMs (Large Language Models).

🚀 Features

🧠 Dual Backend Support: Choose between the PyTorch model or LLM-based transliteration
🌍 Multi-Language: 12+ Indic languages (Hindi, Tamil, Telugu, Bengali, etc.)
🔄 Bidirectional: Supports both Indic→English and English→Indic transliteration
🛡️ Production Ready: Safe file handling, atomic writes, backup support
📊 Structured Output: Rich JSON format with metadata and error handling
⚡ Batch Processing: Efficient processing of large files with progress tracking

🎯 Supported Languages

Hindi • Tamil • Telugu • Bengali • Gujarati • Kannada • Malayalam • Punjabi • Marathi • Odia • Urdu • Sanskrit ↔ English

Install

We strongly recommend installing indicate inside a Python virtual environment (see venv documentation)

Requirements: Python 3.13+

pip install indicate

🔧 Quick Setup

For LLM-based transliteration (recommended):

pip install indicate

# Set your API key (choose one):
export OPENAI_API_KEY=your-key
export ANTHROPIC_API_KEY=your-key  
export GOOGLE_API_KEY=your-key

For the local model (no API key):

pip install indicate
# No API key needed. The PyTorch weights are downloaded once from Hugging Face
# (soodoku/indicate) on first transliterate and cached locally; tokenizers ship
# in the wheel. After the first run it works fully offline.

🎯 Usage

🧠 LLM-Based Transliteration (New!)

The LLM backend provides higher accuracy and supports all Indic languages:

# Simple transliteration (auto-detects Hindi)
indicate llm "राजशेखर चिंतालपति"
# Output: Rajashekar Chintalapati

# Specify languages explicitly  
indicate llm "முருகன்" --source tamil --target english
# Output: Murugan

# Between Indic languages
indicate llm "नमस्ते" --source hindi --target tamil  
# Output: நமஸ்தே

# Safe batch processing with structured JSON output
indicate llm --input names.txt --output results.json --format json --batch --backup

# Dry run to preview changes
indicate llm --input large_file.txt --dry-run

Python API:

from indicate import IndicLLMTransliterator

# Initialize for any language pair
transliterator = IndicLLMTransliterator('hindi', 'english')
result = transliterator.transliterate('राजशेखर चिंतालपति')
print(result)  # Output: Rajashekar Chintalapati

# Batch processing
texts = ["राजेश", "गौरव", "प्रिया"]
results = transliterator.transliterate_batch(texts)
print(results)  # ['Rajesh', 'Gaurav', 'Priya']

🤖 PyTorch Backend (Traditional)

Local offline models are available for Hindi (Devanagari) and Punjabi (Gurmukhi):

# Hindi to English using the local PyTorch model
indicate hindi2english "राजशेखर चिंतालपति"
# Output: rajshekhar chintapalati

# Punjabi (Gurmukhi) to English
indicate punjabi2english "ਰਵਿ ਸ਼ਰਮਾ"
# Output: ravi sharma

# From file
indicate hindi2english --input hindi.txt --output english.txt

# Batch processing
indicate hindi2english --input large_file.txt --batch

Python API:

from indicate import hindi2english, punjabi2english
print(hindi2english("हिंदी"))                # "hindi"
print(punjabi2english("ਰਵਿ"))                # "ravi"

# Top-k candidates (n > 1 returns a list)
print(hindi2english("नमस्ते", n=3))          # ["namaste", "namastey", "namste"]

# Batched (much faster for many inputs)
from indicate.hindi2english import HindiToEnglish
HindiToEnglish.transliterate_batch(["हिंदी", "मुंबई", "गौरव सूद"])
# -> ["hindi", "mumbai", "gaurav sood"]

📊 JSON Output Format

The LLM backend provides rich, structured output perfect for data processing:

{
  "metadata": {
    "source_language": "hindi",
    "target_language": "english", 
    "timestamp": "2024-12-09T12:00:00Z",
    "total_lines": 3,
    "successful_lines": 3,
    "failed_lines": 0,
    "encoding": "utf-8"
  },
  "results": [
    {
      "line_number": 1,
      "input_text": "राजेश कुमार",
      "output_text": "Rajesh Kumar", 
      "source_lang": "hindi",
      "target_lang": "english",
      "confidence": "high",
      "processing_time": 1.2,
      "timestamp": "2024-12-09T12:00:01Z"
    }
  ]
}

🛡️ Safety Features

🔒 Input/Output Validation: Prevents accidental file overwrites
⚛️ Atomic Writing: Safe file operations using temporary files
💾 Automatic Backups: Optional timestamped backups of existing files
🔄 Resume Support: Resume interrupted batch operations
👁️ Dry Run Mode: Preview operations before execution

🎛️ Advanced Usage

# Show few-shot examples being used
indicate llm --show-examples --source bengali --target english

# Resume interrupted batch job
indicate llm --input large_file.txt --output results.txt --resume

# Use specific LLM provider/model
indicate llm "text" --provider anthropic --model claude-3-opus

# Process JSON from previous results
indicate llm --input results.json --source english --target hindi

🔄 Backend Comparison

Feature	PyTorch Backend	LLM Backend
Languages	Hindi ↔ English only	12+ Indic languages ↔ English + Inter-Indic
Setup	No API key needed	Requires LLM API key
Speed	Very fast (local)	Moderate (API calls)
Accuracy	Good for common words	Excellent for all types
Cost	Free	Pay per API call
Offline	✅ Works offline	❌ Requires internet
Batch Processing	✅	✅ with safety features

🧪 Testing Locally

Clone and install:

git clone https://github.com/in-rolls/indicate.git
cd indicate
uv sync  # or pip install -e .

Run tests:

# All tests
python -m pytest

# Specific tests
python -m pytest tests/test_llm_indic.py
python -m pytest tests/test_file_safety.py

Test both backends:

# PyTorch backend
indicate hindi2english "हिंदी"

# LLM backend (set API key first)
export OPENAI_API_KEY=your-key
indicate llm "हिंदी"

Data

The datasets used to train the model:

Indian Election affidavits
Google Dakshina dataset
ESPN Cric Info for hindi version of the english scorecard
IIT Bombay English-Hindi Corpus

Evaluation

The v2 models (trained on our data + the public Aksharantar corpus) are benchmarked against AI4Bharat IndicXlit — the same direction (native→Latin), the same test sets, the same metric (Top-1 exact-match, match-any-reference). Training is leakage-filtered so no eval word appears in it.

Model	Dakshina (gold)	Held-out-own names¹
Hindi → English	74.4% (IndicXlit 73.2%)	52.8% (IndicXlit 49.7%)
Punjabi → English	71.9% (IndicXlit 73.2%)	56.9% (IndicXlit 53.5%)

¹ Held-out slice of our own electoral/affidavit names — the cleanest comparison, since IndicXlit never trained on it. v2 matches or edges IndicXlit on the gold benchmark and beats it on the deployment domain. Primary metric is Top-1 exact-match; CER (character error rate) is the soft companion. Reproduce with training/eval.py and training/compare.py.

Below is the edit-distance distribution on the test set (0 = exact match):

Edit distance metrics of model on Google Dakshina test dataset

Authors

Rajashekar Chintalapati and Gaurav Sood

Contributor Code of Conduct

The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct.

License

The package is released under the MIT License.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rajashekar soodoku

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.7.0

Jun 14, 2026

0.5.1

Dec 10, 2025

0.5.0

Dec 9, 2025

0.4.0

Dec 4, 2025

0.3.1

Sep 3, 2025

0.3.0

Sep 3, 2025

0.2.1

May 1, 2025

0.2.0

Feb 15, 2025

0.1.0

Aug 22, 2022

0.0.9

Jan 20, 2022

0.0.8

Jan 20, 2022

0.0.7

Jan 10, 2022

0.0.6

Jan 10, 2022

0.0.5

Dec 16, 2021

0.0.4

Dec 16, 2021

0.0.3

Nov 12, 2021

0.0.2

Nov 12, 2021

0.0.1

Nov 12, 2021

0.0.1rc1 pre-release

Nov 12, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indicate-0.7.0.tar.gz (47.0 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

indicate-0.7.0-py3-none-any.whl (52.1 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file indicate-0.7.0.tar.gz.

File metadata

Download URL: indicate-0.7.0.tar.gz
Upload date: Jun 14, 2026
Size: 47.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for indicate-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`af0f4e27c54a1bafac685dec36f0069b41d1af6df6eb39e2fa9a8cdd191b1aa7`
MD5	`bf4ed6c42ee3f6a73678842514b9e7ae`
BLAKE2b-256	`6fcd696bb206b227f5949883f05ee51e2ed3f6b38e17e4588afd211101adf33d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for indicate-0.7.0.tar.gz:

Publisher: python-publish.yml on in-rolls/indicate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: indicate-0.7.0.tar.gz
- Subject digest: af0f4e27c54a1bafac685dec36f0069b41d1af6df6eb39e2fa9a8cdd191b1aa7
- Sigstore transparency entry: 1818756120
- Sigstore integration time: Jun 14, 2026
Source repository:
- Permalink: in-rolls/indicate@980b2a8ab4318fae59e9d3062e5a75fb2bafe087
- Branch / Tag: refs/heads/master
- Owner: https://github.com/in-rolls
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@980b2a8ab4318fae59e9d3062e5a75fb2bafe087
- Trigger Event: workflow_dispatch

File details

Details for the file indicate-0.7.0-py3-none-any.whl.

File metadata

Download URL: indicate-0.7.0-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 52.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for indicate-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`309e4ebe06128be53bb5e70dbb0d23b9e463bec0a26dd0d272c6a037db383868`
MD5	`fe2e8c76c0d0c45bedd20c5eac19bba3`
BLAKE2b-256	`5f01c4f5b1e34051106afe42baf9efaa461ca0f4d197fa2d8f4686dd799371c1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for indicate-0.7.0-py3-none-any.whl:

Publisher: python-publish.yml on in-rolls/indicate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: indicate-0.7.0-py3-none-any.whl
- Subject digest: 309e4ebe06128be53bb5e70dbb0d23b9e463bec0a26dd0d272c6a037db383868
- Sigstore transparency entry: 1818756420
- Sigstore integration time: Jun 14, 2026
Source repository:
- Permalink: in-rolls/indicate@980b2a8ab4318fae59e9d3062e5a75fb2bafe087
- Branch / Tag: refs/heads/master
- Owner: https://github.com/in-rolls
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@980b2a8ab4318fae59e9d3062e5a75fb2bafe087
- Trigger Event: workflow_dispatch

indicate 0.7.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Indicate: Transliterate Indic Languages with PyTorch and LLMs

🚀 Features

🎯 Supported Languages

Install

🔧 Quick Setup

For LLM-based transliteration (recommended):

For the local model (no API key):

🎯 Usage

🧠 LLM-Based Transliteration (New!)

🤖 PyTorch Backend (Traditional)

📊 JSON Output Format

🛡️ Safety Features

🎛️ Advanced Usage

🔄 Backend Comparison

🧪 Testing Locally

Data

Evaluation

Authors

Contributor Code of Conduct

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance