Skip to main content

TransFuzzy is a robust transliteration system that bridges the gap between Indic scripts and the Latin alphabet.

Project description


๐Ÿ”ค TransFuzzy

Multilingual Fuzzy Name Matching โ€” phonetic + semantic + ML, all in one pipeline.

Python Version License uv Flask Random Forest

Features โ€ข Architecture โ€ข Quick Start โ€ข API Reference โ€ข Training โ€ข Contributing


โœจ Features

  • ๐ŸŒ Multilingual โ€” Supports English, Hindi (Devanagari), Telugu, Tamil, Kannada, Malayalam, Gujarati, and Gurmukhi out of the box
  • ๐Ÿ”Š Phonetic matching โ€” Soundex and Metaphone codes to catch phonetically similar spellings
  • ๐Ÿ“ String distance โ€” Levenshtein and Jaro-Winkler similarity
  • ๐Ÿง  Semantic embeddings โ€” all-MiniLM-L6-v2 sentence transformer for semantic closeness
  • ๐ŸŒฒ ML classifier โ€” A trained Random Forest model that combines all metrics for a final confident prediction
  • โšก Fast โ€” Pre-filters candidates by first letter, batch-encodes embeddings, and loads models once at startup
  • ๐Ÿ–ฅ๏ธ Web UI โ€” Clean browser-based interface, zero frontend framework required

๐Ÿ›๏ธ Architecture

transfuzzy/
โ”œโ”€โ”€ main.py               # Flask app โ€” routes, transliteration, orchestration
โ”œโ”€โ”€ dir/
โ”‚   โ”œโ”€โ”€ create_csv.py     # Step 1: pair input name against the names database
โ”‚   โ”œโ”€โ”€ calculate_ratios.py  # Step 2: compute 8 similarity metrics per pair
โ”‚   โ”œโ”€โ”€ compute_metrics.py   # Step 3: RF model predicts + hybrid scoring
โ”‚   โ”œโ”€โ”€ enrich_data.py    # (Training) generate positive/negative training pairs
โ”‚   โ””โ”€โ”€ train_model.py    # (Training) GridSearchCV to train & save best RF model
โ”œโ”€โ”€ utils/
โ”‚   โ””โ”€โ”€ response.py       # Standardised JSON response helper
โ”œโ”€โ”€ db/
โ”‚   โ”œโ”€โ”€ names_2.txt              # Names database (one name per line)
โ”‚   โ”œโ”€โ”€ names.csv                # Enriched training data
โ”‚   โ””โ”€โ”€ best_random_forest_model.pkl  # Pre-trained model (committed)
โ”œโ”€โ”€ templates/
โ”‚   โ””โ”€โ”€ index.html        # Jinja2 template for the web UI
โ”œโ”€โ”€ static/
โ”‚   โ”œโ”€โ”€ styles.css
โ”‚   โ”œโ”€โ”€ api.js
โ”‚   โ”œโ”€โ”€ ui.js
โ”‚   โ””โ”€โ”€ app.js
โ”œโ”€โ”€ pyproject.toml        # Project metadata & dependencies (uv)
โ””โ”€โ”€ scripts/
    โ”œโ”€โ”€ dev.py            # Cross-platform dev launcher (uv run + open browser)
    โ”œโ”€โ”€ enrich.py         # Convenience wrapper: enrich_data pipeline
    โ””โ”€โ”€ train.py          # Convenience wrapper: train model pipeline

Inference Pipeline

Input Name
    โ”‚
    โ–ผ
[Script Detection]  โ”€โ”€โ”€โ”€ Devanagari/Telugu/etc? โ”€โ”€โ–บ Transliterate to ITRANS
    โ”‚
    โ–ผ
[Create Pairs]      โ”€โ”€โ”€โ”€ Compare against ~73k names (pre-filtered by 1st char)
    โ”‚
    โ–ผ
[Calculate Ratios]  โ”€โ”€โ”€โ”€ 8 metrics: Soundex, Metaphone, Levenshtein,
    โ”‚                    Jaro-Winkler, Cosine, Euclidean, Manhattan, Pearson
    โ–ผ
[RF Classifier]     โ”€โ”€โ”€โ”€ Predict probability of match (class 'y')
    โ”‚
    โ–ผ
[Hybrid Filter]     โ”€โ”€โ”€โ”€ Accept if: high RF confidence OR phonetic match
    โ”‚                    Reject if composite score < 0.70
    โ–ผ
[Results]           โ”€โ”€โ”€โ”€ Sorted by composite score, transliterated back

๐Ÿš€ Quick Start

Prerequisites

Tool Version Install
Python โ‰ฅ 3.11 python.org
uv latest pip install uv or docs.astral.sh/uv

1. Clone the repository

git clone https://github.com/your-username/transfuzzy.git
cd transfuzzy

2. Install dependencies

uv sync

That's it. uv sync reads pyproject.toml, creates a virtual environment (.venv), and installs all pinned dependencies from uv.lock.

3. Run the development server

# Cross-platform launcher โ€” starts Flask AND opens your browser automatically
uv run python scripts/dev.py

Or, if you prefer the raw Flask command:

uv run python main.py

The app will be available at http://localhost:5000


๐Ÿ“ก API Reference

POST /similar_names

Find names phonetically/semantically similar to the input.

Request

POST /similar_names
Content-Type: application/json

{
  "name": "Rahul"
}

Supported scripts โ€” you can also pass names in:

  • Devanagari: "เคฐเคพเคนเฅเคฒ"
  • Telugu: "เฐฐเฐพเฐนเฑเฐฒเฑ"
  • Tamil, Kannada, Malayalam, Gujarati, Gurmukhi

Response (200 OK)

{
  "similar_names": ["Rahul", "Raahul", "Rahool", "Rahil"]
}

Error Response

{
  "error": "name parameter is required"
}
Status Meaning
200 Success โ€” similar_names array returned
400 Bad request โ€” missing/invalid name field
500 Server error โ€” model or database file issue

cURL Example

curl -X POST http://localhost:5000/similar_names \
  -H "Content-Type: application/json" \
  -d '{"name": "Priya"}'

๐ŸŽ“ Training Your Own Model

If you want to retrain the Random Forest model on your own name data, follow these steps.

Step 1 โ€” Prepare your names data

Edit db/names2.txt. Each line defines a cluster of similar names:

Rahul > Raahul, Rahool, Rahil
Priya > Preya, Priyah, Pria
Arjun > Arjoon, Arjuun, Arjan

Names within the same cluster = positive pairs. Names across different clusters (but starting with the same letter) = hard negative pairs.

Step 2 โ€” Enrich the data (compute similarity metrics)

uv run python scripts/enrich.py

This runs dir/enrich_data.py which:

  1. Parses clusters from db/names2.txt
  2. Generates positive + hard-negative name pairs
  3. Computes all 8 similarity metrics for each pair
  4. Saves enriched training data to db/names.csv

โš ๏ธ This step loads the sentence-transformer model and may take 5โ€“15 minutes depending on the size of your dataset.

Step 3 โ€” Train the model

uv run python scripts/train.py

This runs dir/train_model.py which:

  1. Loads db/names.csv
  2. Runs GridSearchCV over Random Forest hyperparameters
  3. Evaluates on a 25% held-out test set
  4. Saves the best model to db/best_random_forest_model.pkl

๐Ÿงฉ Similarity Metrics Explained

Metric Type Description
soundex_ratio Phonetic Similarity of Soundex codes (letter+digit hash)
metaphone_ratio Phonetic Similarity of Metaphone codes (pronunciation hash)
levenshtein_ratio String 1 โˆ’ (edit distance / max length)
jaro_winkler_ratio String Jaro-Winkler similarity (best for short strings)
cosine_similarity Embedding Cosine angle between MiniLM embeddings
euclidean_similarity Embedding 1 / (1 + euclidean distance)
manhattan_similarity Embedding 1 / (1 + L1 distance)
pearson_similarity Embedding (Pearson correlation + 1) / 2

The Random Forest classifier is trained on all 8 features.
At inference, results are filtered using a hybrid scoring system:

  • RF confidence โ‰ฅ 0.60, OR
  • RF confidence โ‰ฅ 0.20 AND phonetic match (Soundex/Metaphone), OR
  • Jaro-Winkler โ‰ฅ 0.92 (obvious variants)

Then a composite weighted score filters out low-quality matches (threshold: 0.70).


๐Ÿ—‚๏ธ Project Structure Details

db/
โ”œโ”€โ”€ names_2.txt          # Runtime names database (73k+ names, one per line)
โ”œโ”€โ”€ names2.txt           # Clustered names for training (Name > Variant1, Variant2)
โ”œโ”€โ”€ names.csv            # Training data with computed metrics (generated by enrich.py)
โ””โ”€โ”€ best_random_forest_model.pkl  # Trained classifier

Note: db/names.csv is auto-generated and gitignored. best_random_forest_model.pkl IS committed so contributors can run the app without retraining.


๐Ÿค Contributing

Contributions are welcome! Here are some ways you can help:

  • ๐ŸŒ Add more names to the database (db/names_2.txt)
  • ๐Ÿ“Š Add more name clusters for training (db/names2.txt)
  • ๐Ÿ”ค Add support for new Indic scripts
  • ๐Ÿ› Report bugs via GitHub Issues
  • โœจ Improve the matching pipeline or scoring thresholds

Development Setup

git clone https://github.com/your-username/transfuzzy.git
cd transfuzzy
uv sync
uv run python scripts/dev.py

Submitting a Pull Request

  1. Fork the repository
  2. Create a feature branch: git checkout -b feat/your-feature
  3. Commit your changes: git commit -m 'feat: add support for Bengali script'
  4. Push and open a PR

Please follow Conventional Commits for commit messages.


๐Ÿ“‹ Requirements Summary

All dependencies are managed via uv and pinned in uv.lock:

Package Purpose
flask Web framework
flask-cors Cross-origin support
fuzzywuzzy Fuzzy string matching
python-levenshtein Fast Levenshtein distance
jellyfish Phonetic algorithms (Soundex, Metaphone, Jaro-Winkler)
sentence-transformers Semantic embeddings (all-MiniLM-L6-v2)
scikit-learn Random Forest classifier
indic-transliteration Devanagari/Telugu/Tamil etc. โ†’ ITRANS
pandas, numpy, scipy Data manipulation and math
joblib Model serialization
matplotlib Feature importance plots (training only)

๐Ÿ“„ License

MIT ยฉ Goutham Dechineni


Made with โค๏ธ for the open-source community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transfuzzy-0.1.0.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

transfuzzy-0.1.0-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file transfuzzy-0.1.0.tar.gz.

File metadata

  • Download URL: transfuzzy-0.1.0.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for transfuzzy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2900b7d06d47e86ac5f69214ebbf422cec61eeef89ed527690e5a0ebac5d491d
MD5 7a6fb9a160bf02313bdffee63042e48e
BLAKE2b-256 0f16a1c6e75d567b30ad6fcbf55cebbb28ca06b91922d8482929dd6dff039ce6

See more details on using hashes here.

File details

Details for the file transfuzzy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: transfuzzy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for transfuzzy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 89a62e4b11bca8fc7ff110cfa1519c426076d3c22c429b0956c206afa7aeff9c
MD5 53c193e60bdc51961e90c3914e1d3595
BLAKE2b-256 21de095d723afee3c30cb786c3dc57848abb4ddbb9ac4a76f66a235de7af6183

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page