Skip to main content

TransFuzzy is a robust transliteration system that bridges the gap between Indic scripts and the Latin alphabet.

Project description

TransFuzzy

TransFuzzy is a Python package for multilingual personal-name matching across Latin and several Indic scripts. It exposes the same matching pipeline through a CLI and a Flask API, and it supports switching between the bundled dataset and user-managed datasets.

What It Does

  • Accepts names in Latin, Devanagari, Telugu, Tamil, Kannada, Malayalam, Gujarati, and Gurmukhi.
  • Transliterates non-Latin input before matching.
  • Scores candidate names with phonetic, edit-distance, and embedding-based features.
  • Returns the best matches through transfuzzy predict or the HTTP API.
  • Lets you upload, activate, list, and delete datasets without modifying package files.

Installation

From PyPI

pip install transfuzzy

Local development setup

uv sync

Python 3.11+ is required.

Runtime Notes

TransFuzzy currently loads the sentence-transformers/all-MiniLM-L6-v2 model during module import. On a fresh machine, the first CLI, API, or test run may download model files from Hugging Face before the command can complete.

That has two practical consequences:

  • The first run can be noticeably slower than later runs.
  • Offline or restricted-network environments can fail before the CLI help text, API startup, or tests finish loading.

Quick Start

Start the API server

transfuzzy

or

transfuzzy serve --port 3000

The Flask server listens on http://localhost:3000 and opens that URL in your default browser on startup.

Query from the CLI

transfuzzy predict "Rahul"

Limit results:

transfuzzy predict "Rahul" --top 5

Return JSON:

transfuzzy predict "Rahul" --json

Use a specific text dataset file directly:

transfuzzy predict "Rahul" --db .\names.txt --top 5 --json

Supported Input Scripts

Examples of valid input:

Rahul
राहुल
రాహుల్

The output is transliterated back to the original script when the input was converted from a supported Indic script.

CLI Reference

transfuzzy

Starts the API server on port 3000 and opens the browser automatically.

transfuzzy serve

Run the API server explicitly.

transfuzzy serve --port 3000

Use --no-browser to skip opening the browser:

transfuzzy serve --port 3000 --no-browser

transfuzzy predict

Find similar names for a single input.

transfuzzy predict <name> [--top N] [--json] [--db PATH]

Arguments:

  • <name>: required input string.
  • --top: maximum number of matches to return. Default: 10.
  • --json: print a JSON object with similar_names.
  • --db: use a dataset file path directly instead of the active managed dataset.

transfuzzy db

Manage datasets stored in the TransFuzzy home directory.

Add a dataset:

transfuzzy db add .\names.txt

List managed datasets:

transfuzzy db list

Set the active dataset:

transfuzzy db use names.txt

Delete a managed dataset:

transfuzzy db delete names.txt

API Reference

POST /similar_names

Request body:

{
  "name": "Rahul"
}

Success response:

{
  "similar_names": ["Rahul", "Raahul", "Rahool"]
}

Validation errors are returned as JSON with an error field and the appropriate HTTP status code.

POST /upload_db

Uploads a dataset file using multipart/form-data with the field name file.

Success response shape:

{
  "message": "Dataset 'demo.txt' uploaded",
  "dataset_name": "demo.txt",
  "active_db": null
}

GET /list_dbs

Returns the stored managed datasets and the active dataset name.

{
  "datasets": ["demo.txt"],
  "active_db": "demo.txt"
}

POST /use_db

Request body:

{
  "name": "demo.txt"
}

DELETE /delete_db

Request body:

{
  "name": "demo.txt"
}

Dataset Management

There are two ways to provide names:

  1. Pass a file path directly with --db.
  2. Store datasets with transfuzzy db ... or the dataset API routes and switch the active dataset.

Managed datasets are stored under:

%USERPROFILE%\.transfuzzy\datasets

The active dataset name is stored in:

%USERPROFILE%\.transfuzzy\config.json

To override the base directory, set:

$env:TRANSFUZZY_HOME = "C:\path\to\custom-home"

Each dataset should contain one name per line.

How Matching Works

The current pipeline is:

Input name
-> transliteration to Latin when needed
-> candidate pair generation from the selected dataset
-> feature computation
   - Soundex ratio
   - Metaphone ratio
   - Levenshtein ratio
   - Jaro-Winkler similarity
   - Cosine similarity
   - Euclidean similarity
   - Manhattan similarity
   - Pearson similarity
-> trained model ranking
-> optional transliteration back to the input script

Project Structure

src/transfuzzy/
├── app.py              Flask app and HTTP routes
├── cli.py              CLI entrypoint
├── core/
│   ├── config.py       package constants and paths
│   ├── db_manager.py   managed dataset storage
│   └── pipeline.py     top-level matching pipeline
├── datasets/
│   └── default.txt     bundled dataset asset
├── db/                 packaged training/runtime artifacts
├── dir/                feature generation and training scripts
├── static/             browser-side assets
├── templates/          HTML templates
└── utils/              helper and response utilities

Development

Run the app locally:

uv run transfuzzy serve

Run tests:

uv run python -m unittest discover -s tests -v

Run training-related scripts:

uv run python src/transfuzzy/dir/enrich_data.py
uv run python src/transfuzzy/dir/train_model.py

More development notes are in docs/DEVELOPMENT.md.

Current Limitations

  • Import-time model loading makes commands and tests depend on model availability.
  • The package metadata says the project is a transliteration system, while the implementation is broader name matching.
  • The repository contains packaged model and dataset artifacts in src/transfuzzy/db, so development and release size are coupled to those files.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transfuzzy-0.1.2.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

transfuzzy-0.1.2-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file transfuzzy-0.1.2.tar.gz.

File metadata

  • Download URL: transfuzzy-0.1.2.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for transfuzzy-0.1.2.tar.gz
Algorithm Hash digest
SHA256 bd29e58df9e5e9e56d1c304ba3665f03a965bdbdcc2dcf92408b1bacd61bc84b
MD5 3773bc5ad75aded4a7817e8989f544b3
BLAKE2b-256 b8536988f62dddaf7f7e3438b5eaf2edac8dbe2eea2695bb0c69c773766981f9

See more details on using hashes here.

File details

Details for the file transfuzzy-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: transfuzzy-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for transfuzzy-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 82929858d8ac483fdea2a27d3fc380456deb27748e69a0e08b0fc29b6077b456
MD5 ce6cdc45059776317825569874b4a2cb
BLAKE2b-256 0b6241758e481a83b41a2b0d9ca8f5b202a54c19bfdc2ec297108d936774fb47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page