TransFuzzy is a robust transliteration system that bridges the gap between Indic scripts and the Latin alphabet.

Project description

🔤 TransFuzzy

Multilingual Fuzzy Name Matching — phonetic + semantic + ML, all in one pipeline.

Flask Random Forest

Features • Architecture • Quick Start • API Reference • Training • Contributing

✨ Features

🌐 Multilingual — Supports English, Hindi (Devanagari), Telugu, Tamil, Kannada, Malayalam, Gujarati, and Gurmukhi out of the box
🔊 Phonetic matching — Soundex and Metaphone codes to catch phonetically similar spellings
📐 String distance — Levenshtein and Jaro-Winkler similarity
🧠 Semantic embeddings — all-MiniLM-L6-v2 sentence transformer for semantic closeness
🌲 ML classifier — A trained Random Forest model that combines all metrics for a final confident prediction
⚡ Fast — Pre-filters candidates by first letter, batch-encodes embeddings, and loads models once at startup
🖥️ Web UI — Clean browser-based interface, zero frontend framework required

🏛️ Architecture

transfuzzy/
├── main.py               # Flask app — routes, transliteration, orchestration
├── dir/
│   ├── create_csv.py     # Step 1: pair input name against the names database
│   ├── calculate_ratios.py  # Step 2: compute 8 similarity metrics per pair
│   ├── compute_metrics.py   # Step 3: RF model predicts + hybrid scoring
│   ├── enrich_data.py    # (Training) generate positive/negative training pairs
│   └── train_model.py    # (Training) GridSearchCV to train & save best RF model
├── utils/
│   └── response.py       # Standardised JSON response helper
├── db/
│   ├── names_2.txt              # Names database (one name per line)
│   ├── names.csv                # Enriched training data
│   └── best_random_forest_model.pkl  # Pre-trained model (committed)
├── templates/
│   └── index.html        # Jinja2 template for the web UI
├── static/
│   ├── styles.css
│   ├── api.js
│   ├── ui.js
│   └── app.js
├── pyproject.toml        # Project metadata & dependencies (uv)
└── scripts/
    ├── dev.py            # Cross-platform dev launcher (uv run + open browser)
    ├── enrich.py         # Convenience wrapper: enrich_data pipeline
    └── train.py          # Convenience wrapper: train model pipeline

Inference Pipeline

Input Name
    │
    ▼
[Script Detection]  ──── Devanagari/Telugu/etc? ──► Transliterate to ITRANS
    │
    ▼
[Create Pairs]      ──── Compare against ~73k names (pre-filtered by 1st char)
    │
    ▼
[Calculate Ratios]  ──── 8 metrics: Soundex, Metaphone, Levenshtein,
    │                    Jaro-Winkler, Cosine, Euclidean, Manhattan, Pearson
    ▼
[RF Classifier]     ──── Predict probability of match (class 'y')
    │
    ▼
[Hybrid Filter]     ──── Accept if: high RF confidence OR phonetic match
    │                    Reject if composite score < 0.70
    ▼
[Results]           ──── Sorted by composite score, transliterated back

🚀 Quick Start

Prerequisites

Tool	Version	Install
Python	≥ 3.11	python.org
uv	latest	`pip install uv` or docs.astral.sh/uv

1. Clone the repository

git clone https://github.com/your-username/transfuzzy.git
cd transfuzzy

2. Install dependencies

uv sync

That's it. uv sync reads pyproject.toml, creates a virtual environment (.venv), and installs all pinned dependencies from uv.lock.

3. Run the development server

# Cross-platform launcher — starts Flask AND opens your browser automatically
uv run python scripts/dev.py

Or, if you prefer the raw Flask command:

uv run python main.py

The app will be available at http://localhost:5000

📡 API Reference

`POST /similar_names`

Find names phonetically/semantically similar to the input.

Request

POST /similar_names
Content-Type: application/json

{
  "name": "Rahul"
}

Supported scripts — you can also pass names in:

Devanagari: "राहुल"
Telugu: "రాహుల్"
Tamil, Kannada, Malayalam, Gujarati, Gurmukhi

Response (200 OK)

{
  "similar_names": ["Rahul", "Raahul", "Rahool", "Rahil"]
}

Error Response

{
  "error": "name parameter is required"
}

Status	Meaning
`200`	Success — `similar_names` array returned
`400`	Bad request — missing/invalid `name` field
`500`	Server error — model or database file issue

cURL Example

curl -X POST http://localhost:5000/similar_names \
  -H "Content-Type: application/json" \
  -d '{"name": "Priya"}'

🎓 Training Your Own Model

If you want to retrain the Random Forest model on your own name data, follow these steps.

Step 1 — Prepare your names data

Edit db/names2.txt. Each line defines a cluster of similar names:

Rahul > Raahul, Rahool, Rahil
Priya > Preya, Priyah, Pria
Arjun > Arjoon, Arjuun, Arjan

Names within the same cluster = positive pairs. Names across different clusters (but starting with the same letter) = hard negative pairs.

Step 2 — Enrich the data (compute similarity metrics)

uv run python scripts/enrich.py

This runs dir/enrich_data.py which:

Parses clusters from db/names2.txt
Generates positive + hard-negative name pairs
Computes all 8 similarity metrics for each pair
Saves enriched training data to db/names.csv

⚠️ This step loads the sentence-transformer model and may take 5–15 minutes depending on the size of your dataset.

Step 3 — Train the model

uv run python scripts/train.py

This runs dir/train_model.py which:

Loads db/names.csv
Runs GridSearchCV over Random Forest hyperparameters
Evaluates on a 25% held-out test set
Saves the best model to db/best_random_forest_model.pkl

🧩 Similarity Metrics Explained

Metric	Type	Description
`soundex_ratio`	Phonetic	Similarity of Soundex codes (letter+digit hash)
`metaphone_ratio`	Phonetic	Similarity of Metaphone codes (pronunciation hash)
`levenshtein_ratio`	String	1 − (edit distance / max length)
`jaro_winkler_ratio`	String	Jaro-Winkler similarity (best for short strings)
`cosine_similarity`	Embedding	Cosine angle between MiniLM embeddings
`euclidean_similarity`	Embedding	`1 / (1 + euclidean distance)`
`manhattan_similarity`	Embedding	`1 / (1 + L1 distance)`
`pearson_similarity`	Embedding	`(Pearson correlation + 1) / 2`

The Random Forest classifier is trained on all 8 features.
At inference, results are filtered using a hybrid scoring system:

RF confidence ≥ 0.60, OR
RF confidence ≥ 0.20 AND phonetic match (Soundex/Metaphone), OR
Jaro-Winkler ≥ 0.92 (obvious variants)

Then a composite weighted score filters out low-quality matches (threshold: 0.70).

🗂️ Project Structure Details

db/
├── names_2.txt          # Runtime names database (73k+ names, one per line)
├── names2.txt           # Clustered names for training (Name > Variant1, Variant2)
├── names.csv            # Training data with computed metrics (generated by enrich.py)
└── best_random_forest_model.pkl  # Trained classifier

Note: db/names.csv is auto-generated and gitignored. best_random_forest_model.pkl IS committed so contributors can run the app without retraining.

🤝 Contributing

Contributions are welcome! Here are some ways you can help:

🌐 Add more names to the database (db/names_2.txt)
📊 Add more name clusters for training (db/names2.txt)
🔤 Add support for new Indic scripts
🐛 Report bugs via GitHub Issues
✨ Improve the matching pipeline or scoring thresholds

Development Setup

git clone https://github.com/your-username/transfuzzy.git
cd transfuzzy
uv sync
uv run python scripts/dev.py

Submitting a Pull Request

Fork the repository
Create a feature branch: git checkout -b feat/your-feature
Commit your changes: git commit -m 'feat: add support for Bengali script'
Push and open a PR

Please follow Conventional Commits for commit messages.

📋 Requirements Summary

All dependencies are managed via uv and pinned in uv.lock:

Package	Purpose
`flask`	Web framework
`flask-cors`	Cross-origin support
`fuzzywuzzy`	Fuzzy string matching
`python-levenshtein`	Fast Levenshtein distance
`jellyfish`	Phonetic algorithms (Soundex, Metaphone, Jaro-Winkler)
`sentence-transformers`	Semantic embeddings (`all-MiniLM-L6-v2`)
`scikit-learn`	Random Forest classifier
`indic-transliteration`	Devanagari/Telugu/Tamil etc. → ITRANS
`pandas`, `numpy`, `scipy`	Data manipulation and math
`joblib`	Model serialization
`matplotlib`	Feature importance plots (training only)

📄 License

MIT © Goutham Dechineni

Made with ❤️ for the open-source community

Project details

Release history Release notifications | RSS feed

0.1.2

Mar 30, 2026

0.1.1

Mar 30, 2026

This version

0.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transfuzzy-0.1.0.tar.gz (20.5 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

transfuzzy-0.1.0-py3-none-any.whl (18.5 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file transfuzzy-0.1.0.tar.gz.

File metadata

Download URL: transfuzzy-0.1.0.tar.gz
Upload date: Mar 30, 2026
Size: 20.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for transfuzzy-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2900b7d06d47e86ac5f69214ebbf422cec61eeef89ed527690e5a0ebac5d491d`
MD5	`7a6fb9a160bf02313bdffee63042e48e`
BLAKE2b-256	`0f16a1c6e75d567b30ad6fcbf55cebbb28ca06b91922d8482929dd6dff039ce6`

See more details on using hashes here.

File details

Details for the file transfuzzy-0.1.0-py3-none-any.whl.

File metadata

Download URL: transfuzzy-0.1.0-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 18.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for transfuzzy-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`89a62e4b11bca8fc7ff110cfa1519c426076d3c22c429b0956c206afa7aeff9c`
MD5	`53c193e60bdc51961e90c3914e1d3595`
BLAKE2b-256	`21de095d723afee3c30cb786c3dc57848abb4ddbb9ac4a76f66a235de7af6183`

See more details on using hashes here.

transfuzzy 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

🔤 TransFuzzy

Multilingual Fuzzy Name Matching — phonetic + semantic + ML, all in one pipeline.

✨ Features

🏛️ Architecture

Inference Pipeline

🚀 Quick Start

Prerequisites

1. Clone the repository

2. Install dependencies

3. Run the development server

📡 API Reference

POST /similar_names

🎓 Training Your Own Model

Step 1 — Prepare your names data

Step 2 — Enrich the data (compute similarity metrics)

Step 3 — Train the model

🧩 Similarity Metrics Explained

🗂️ Project Structure Details

🤝 Contributing

Development Setup

Submitting a Pull Request

📋 Requirements Summary

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`POST /similar_names`