TransFuzzy is a robust transliteration system that bridges the gap between Indic scripts and the Latin alphabet.
Project description
๐ค TransFuzzy
Multilingual Fuzzy Name Matching โ phonetic + semantic + ML, all in one pipeline.
Features โข Architecture โข Quick Start โข API Reference โข Training โข Contributing
โจ Features
- ๐ Multilingual โ Supports English, Hindi (Devanagari), Telugu, Tamil, Kannada, Malayalam, Gujarati, and Gurmukhi out of the box
- ๐ Phonetic matching โ Soundex and Metaphone codes to catch phonetically similar spellings
- ๐ String distance โ Levenshtein and Jaro-Winkler similarity
- ๐ง Semantic embeddings โ
all-MiniLM-L6-v2sentence transformer for semantic closeness - ๐ฒ ML classifier โ A trained Random Forest model that combines all metrics for a final confident prediction
- โก Fast โ Pre-filters candidates by first letter, batch-encodes embeddings, and loads models once at startup
- ๐ฅ๏ธ Web UI โ Clean browser-based interface, zero frontend framework required
๐๏ธ Architecture
transfuzzy/
โโโ main.py # Flask app โ routes, transliteration, orchestration
โโโ dir/
โ โโโ create_csv.py # Step 1: pair input name against the names database
โ โโโ calculate_ratios.py # Step 2: compute 8 similarity metrics per pair
โ โโโ compute_metrics.py # Step 3: RF model predicts + hybrid scoring
โ โโโ enrich_data.py # (Training) generate positive/negative training pairs
โ โโโ train_model.py # (Training) GridSearchCV to train & save best RF model
โโโ utils/
โ โโโ response.py # Standardised JSON response helper
โโโ db/
โ โโโ names_2.txt # Names database (one name per line)
โ โโโ names.csv # Enriched training data
โ โโโ best_random_forest_model.pkl # Pre-trained model (committed)
โโโ templates/
โ โโโ index.html # Jinja2 template for the web UI
โโโ static/
โ โโโ styles.css
โ โโโ api.js
โ โโโ ui.js
โ โโโ app.js
โโโ pyproject.toml # Project metadata & dependencies (uv)
โโโ scripts/
โโโ dev.py # Cross-platform dev launcher (uv run + open browser)
โโโ enrich.py # Convenience wrapper: enrich_data pipeline
โโโ train.py # Convenience wrapper: train model pipeline
Inference Pipeline
Input Name
โ
โผ
[Script Detection] โโโโ Devanagari/Telugu/etc? โโโบ Transliterate to ITRANS
โ
โผ
[Create Pairs] โโโโ Compare against ~73k names (pre-filtered by 1st char)
โ
โผ
[Calculate Ratios] โโโโ 8 metrics: Soundex, Metaphone, Levenshtein,
โ Jaro-Winkler, Cosine, Euclidean, Manhattan, Pearson
โผ
[RF Classifier] โโโโ Predict probability of match (class 'y')
โ
โผ
[Hybrid Filter] โโโโ Accept if: high RF confidence OR phonetic match
โ Reject if composite score < 0.70
โผ
[Results] โโโโ Sorted by composite score, transliterated back
๐ Quick Start
Prerequisites
| Tool | Version | Install |
|---|---|---|
| Python | โฅ 3.11 | python.org |
| uv | latest | pip install uv or docs.astral.sh/uv |
1. Clone the repository
git clone https://github.com/your-username/transfuzzy.git
cd transfuzzy
2. Install dependencies
uv sync
That's it. uv sync reads pyproject.toml, creates a virtual environment (.venv), and installs all pinned dependencies from uv.lock.
3. Run the development server
# Cross-platform launcher โ starts Flask AND opens your browser automatically
uv run python scripts/dev.py
Or, if you prefer the raw Flask command:
uv run python main.py
The app will be available at http://localhost:5000
๐ก API Reference
POST /similar_names
Find names phonetically/semantically similar to the input.
Request
POST /similar_names
Content-Type: application/json
{
"name": "Rahul"
}
Supported scripts โ you can also pass names in:
- Devanagari:
"เคฐเคพเคนเฅเคฒ" - Telugu:
"เฐฐเฐพเฐนเฑเฐฒเฑ" - Tamil, Kannada, Malayalam, Gujarati, Gurmukhi
Response (200 OK)
{
"similar_names": ["Rahul", "Raahul", "Rahool", "Rahil"]
}
Error Response
{
"error": "name parameter is required"
}
| Status | Meaning |
|---|---|
200 |
Success โ similar_names array returned |
400 |
Bad request โ missing/invalid name field |
500 |
Server error โ model or database file issue |
cURL Example
curl -X POST http://localhost:5000/similar_names \
-H "Content-Type: application/json" \
-d '{"name": "Priya"}'
๐ Training Your Own Model
If you want to retrain the Random Forest model on your own name data, follow these steps.
Step 1 โ Prepare your names data
Edit db/names2.txt. Each line defines a cluster of similar names:
Rahul > Raahul, Rahool, Rahil
Priya > Preya, Priyah, Pria
Arjun > Arjoon, Arjuun, Arjan
Names within the same cluster = positive pairs. Names across different clusters (but starting with the same letter) = hard negative pairs.
Step 2 โ Enrich the data (compute similarity metrics)
uv run python scripts/enrich.py
This runs dir/enrich_data.py which:
- Parses clusters from
db/names2.txt - Generates positive + hard-negative name pairs
- Computes all 8 similarity metrics for each pair
- Saves enriched training data to
db/names.csv
โ ๏ธ This step loads the sentence-transformer model and may take 5โ15 minutes depending on the size of your dataset.
Step 3 โ Train the model
uv run python scripts/train.py
This runs dir/train_model.py which:
- Loads
db/names.csv - Runs
GridSearchCVover Random Forest hyperparameters - Evaluates on a 25% held-out test set
- Saves the best model to
db/best_random_forest_model.pkl
๐งฉ Similarity Metrics Explained
| Metric | Type | Description |
|---|---|---|
soundex_ratio |
Phonetic | Similarity of Soundex codes (letter+digit hash) |
metaphone_ratio |
Phonetic | Similarity of Metaphone codes (pronunciation hash) |
levenshtein_ratio |
String | 1 โ (edit distance / max length) |
jaro_winkler_ratio |
String | Jaro-Winkler similarity (best for short strings) |
cosine_similarity |
Embedding | Cosine angle between MiniLM embeddings |
euclidean_similarity |
Embedding | 1 / (1 + euclidean distance) |
manhattan_similarity |
Embedding | 1 / (1 + L1 distance) |
pearson_similarity |
Embedding | (Pearson correlation + 1) / 2 |
The Random Forest classifier is trained on all 8 features.
At inference, results are filtered using a hybrid scoring system:
- RF confidence โฅ 0.60, OR
- RF confidence โฅ 0.20 AND phonetic match (Soundex/Metaphone), OR
- Jaro-Winkler โฅ 0.92 (obvious variants)
Then a composite weighted score filters out low-quality matches (threshold: 0.70).
๐๏ธ Project Structure Details
db/
โโโ names_2.txt # Runtime names database (73k+ names, one per line)
โโโ names2.txt # Clustered names for training (Name > Variant1, Variant2)
โโโ names.csv # Training data with computed metrics (generated by enrich.py)
โโโ best_random_forest_model.pkl # Trained classifier
Note:
db/names.csvis auto-generated and gitignored.best_random_forest_model.pklIS committed so contributors can run the app without retraining.
๐ค Contributing
Contributions are welcome! Here are some ways you can help:
- ๐ Add more names to the database (
db/names_2.txt) - ๐ Add more name clusters for training (
db/names2.txt) - ๐ค Add support for new Indic scripts
- ๐ Report bugs via GitHub Issues
- โจ Improve the matching pipeline or scoring thresholds
Development Setup
git clone https://github.com/your-username/transfuzzy.git
cd transfuzzy
uv sync
uv run python scripts/dev.py
Submitting a Pull Request
- Fork the repository
- Create a feature branch:
git checkout -b feat/your-feature - Commit your changes:
git commit -m 'feat: add support for Bengali script' - Push and open a PR
Please follow Conventional Commits for commit messages.
๐ Requirements Summary
All dependencies are managed via uv and pinned in uv.lock:
| Package | Purpose |
|---|---|
flask |
Web framework |
flask-cors |
Cross-origin support |
fuzzywuzzy |
Fuzzy string matching |
python-levenshtein |
Fast Levenshtein distance |
jellyfish |
Phonetic algorithms (Soundex, Metaphone, Jaro-Winkler) |
sentence-transformers |
Semantic embeddings (all-MiniLM-L6-v2) |
scikit-learn |
Random Forest classifier |
indic-transliteration |
Devanagari/Telugu/Tamil etc. โ ITRANS |
pandas, numpy, scipy |
Data manipulation and math |
joblib |
Model serialization |
matplotlib |
Feature importance plots (training only) |
๐ License
MIT ยฉ Goutham Dechineni
Made with โค๏ธ for the open-source community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file transfuzzy-0.1.0.tar.gz.
File metadata
- Download URL: transfuzzy-0.1.0.tar.gz
- Upload date:
- Size: 20.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2900b7d06d47e86ac5f69214ebbf422cec61eeef89ed527690e5a0ebac5d491d
|
|
| MD5 |
7a6fb9a160bf02313bdffee63042e48e
|
|
| BLAKE2b-256 |
0f16a1c6e75d567b30ad6fcbf55cebbb28ca06b91922d8482929dd6dff039ce6
|
File details
Details for the file transfuzzy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: transfuzzy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89a62e4b11bca8fc7ff110cfa1519c426076d3c22c429b0956c206afa7aeff9c
|
|
| MD5 |
53c193e60bdc51961e90c3914e1d3595
|
|
| BLAKE2b-256 |
21de095d723afee3c30cb786c3dc57848abb4ddbb9ac4a76f66a235de7af6183
|