A fuzzy join utility using DuckDB

These details have not been verified by PyPI

Project links

Project description

tometo_tomato

tometo_tomato is a Python CLI tool for performing fuzzy joins between two CSV files, even in the presence of typos, abbreviations, or different formatting. It leverages DuckDB and the rapidfuzz extension to associate similar records across different sources.

Features

Join between two CSV files based on textual similarity
Multi-column support (you can specify multiple column pairs or automatically use those with the same name)
Configurable similarity threshold
Separate output for clean and ambiguous matches
Execution statistics logging

Installation

Prerequisites

Python 3.8 or higher
pip or uv (recommended)

Install from source

Clone the repository:

git clone https://github.com/aborruso/tometo_tomato.git
cd tometo_tomato

Install the package: Using uv (recommended):

uv pip install .
# Or, for an isolated CLI tool:
uv tool install .

Using pip:

pip install .

After installation, the tometo_tomato command will be available in your terminal.

Usage

Basic example

Suppose we have two files:

data/input.csv (input data with possible variations)
data/ref.csv (reference data)

These sample files contain names with various formatting issues:

Case variations (Mario Rossi vs MARIO ROSSI)
Extra whitespace (Anna Bianchi vs Anna Bianchi)
Accented characters (José Álvarez vs Jose Alvarez)
Different character encodings (Müller vs Muller)

Basic example

If the columns to compare have the same name in both files:

tometo_tomato data/input.csv data/ref.csv --add-field city_code --threshold 90 --show-score --output-clean output.csv

If the columns have different names:

tometo_tomato input.csv ref.csv \
  --join-pair regione,regio \
  --join-pair comune,comu \
  --add-field codice_comune \
  --threshold 90 \
  --show-score \
  --output-clean output.csv

Automated Processing

For scripts and automated workflows, use --force to skip file overwrite confirmations:

tometo_tomato input.csv ref.csv \
  --join-pair regione,regio \
  --join-pair comune,comu \
  --add-field codice_comune \
  --threshold 90 \
  --show-score \
  --output-clean output.csv \
  --force

Handling Field Names with Spaces

If your column names in the CSV files contain spaces, you must enclose the arguments for --join-pair (-j) and --add-field (-a) in quotes. This ensures that the shell treats them as a single argument.

Example:

Suppose your files have columns named "Input City" and "Reference City", and you want to add a field named "Special Code".

tometo_tomato "input data.csv" "reference data.csv" \
  --join-pair "Input City,Reference City" \
  --add-field "Special Code" \
  --output-clean "clean output.csv"

Normalization Options

Remove Special Characters with --keep-alphanumeric

To ignore punctuation and special characters in join columns, use --keep-alphanumeric (or -k). This keeps only letters, numbers, and spaces before matching:

tometo_tomato data/input.csv data/ref.csv \
  --join-pair name,ref_name \
  --keep-alphanumeric \
  --add-field city_code \
  --threshold 85 \
  --output-clean output.csv

This will match names like:

John O'Connor with John OConnor
Mary-Jane Smith with Mary Jane Smith
Anna & Co. with Anna Co

You can also use the short version:

tometo_tomato data/input.csv data/ref.csv -j name,ref_name -k -o output.csv

Character Normalization with --latinize

For data with accented characters or different character encodings, use --latinize to normalize characters before matching:

tometo_tomato data/input.csv data/ref.csv \
  --join-pair name,ref_name \
  --latinize \
  --add-field city_code \
  --threshold 85 \
  --output-clean output.csv

This will match names like:

José Álvarez with Jose Alvarez
Müller with Muller
François with Francois

The original accented characters are preserved in the output.

Case Sensitivity Control

By default, matching is case-insensitive. Use --raw-case for case-sensitive matching:

tometo_tomato data/input.csv data/ref.csv \
  --join-pair name,ref_name \
  --raw-case \
  --threshold 90 \
  --output-clean output.csv

Whitespace Handling

By default, whitespace is normalized (trimmed and multiple spaces reduced to single space). Use --raw-whitespace to disable this:

tometo_tomato data/input.csv data/ref.csv \
  --join-pair name,ref_name \
  --raw-whitespace \
  --threshold 90 \
  --output-clean output.csv

Main parameters

input.csv : CSV file to enrich/correct
ref.csv : Reference CSV file
--join-pair colA,colB : Pair of columns to compare (repeatable)
--add-field field : Field from the reference file to add to the output
--threshold N : Minimum similarity threshold (default: 90)
--show-score : Show average similarity score
--output-clean : Output file for clean matches (mandatory)
--output-ambiguous : Output file for ambiguous matches (optional)
--scorer ALGO : Fuzzy matching algorithm (ratio or token_set_ratio). Default: ratio.
--clean-whitespace : Remove redundant whitespace from columns before fuzzy matching
--force, -f : Overwrite existing output files without prompting

Logic and Behavior

Fuzzy comparison is case-insensitive by default (the tool applies LOWER() during matching). Use --raw-case to enable case-sensitive comparisons.
Matching uses the configured join pairs and returns the best match per input row when that match's average similarity (avg_score) meets or exceeds the configured --threshold.
Ambiguity definition and handling: an input row is considered "ambiguous" when two or more reference rows obtain the same maximum avg_score for that input (and that maximum score is >= --threshold). Ambiguous input rows are excluded from the "clean" output to avoid adding potentially incorrect data.
Use --output-ambiguous to save a separate CSV with the ambiguous candidate reference rows (the rows having the equal maximum score). If --output-ambiguous is not provided, the program will still detect ambiguous inputs and will print a warning advising how to inspect them.
The clean output file contains only inputs with a single best reference match (unique maximum score >= threshold). If multiple reference rows tie for the best score, that input will not appear in the clean output and will be reported as ambiguous.
You can add extra fields from the reference file using --add-field.
The --latinize and whitespace/case options control normalization before scoring (see the "Normalization Options" section above).
File Overwrite Protection: By default, the script will prompt for confirmation before overwriting existing output files. Use --force to bypass this protection for automated scripts.

Output

A CSV file with clean matches (name and path always specified with --output-clean).
A CSV file with ambiguous matches (only if you specify --output-ambiguous).

Use case example

See the file docs/PRD.md for a detailed description and practical example.

Notes

The --scorer token_set_ratio is recommended for cases where names have different word counts (e.g., "Reggio Calabria" vs. "Reggio di Calabria").
If you don't specify --join-pair, all columns with the same name in both files will be used.
Use --clean-whitespace when your data contains inconsistent spacing (e.g., "Rome City" vs " Rome City ") to improve matching accuracy.
The tool is designed to be simple, robust, and easily integrable into data cleaning workflows.

Development

Running tests

Consigliato: usa un ambiente virtuale (venv) per isolare le dipendenze:

python3 -m venv .venv_test_tometo
source .venv_test_tometo/bin/activate
pip install -e .
pip install pytest
pytest -v

Alternativa: puoi anche eseguire i test senza installare il pacchetto, impostando il PYTHONPATH:

# Run all tests
PYTHONPATH=$(pwd)/src pytest -v

# Run specific test files
PYTHONPATH=$(pwd)/src pytest tests/test_read_header.py -v

Se non hai pytest installato, puoi aggiungerlo con:

pip install pytest

For questions, suggestions, or bugs, open an issue on GitHub!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Aug 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tometo_tomato-0.0.1.tar.gz (1.6 MB view details)

Uploaded Aug 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tometo_tomato-0.0.1-py3-none-any.whl (13.8 kB view details)

Uploaded Aug 24, 2025 Python 3

File details

Details for the file tometo_tomato-0.0.1.tar.gz.

File metadata

Download URL: tometo_tomato-0.0.1.tar.gz
Upload date: Aug 24, 2025
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for tometo_tomato-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`f83b7d8c7faee457704f4eaf9c08615229bffdb1181486894319c36707ab0096`
MD5	`0ca5db20d793edbfc04515599d96c6fd`
BLAKE2b-256	`8d6636f40da92a533ffda2e6148f82485fa4495b633e2a9eb4288f651ebb901b`

See more details on using hashes here.

File details

Details for the file tometo_tomato-0.0.1-py3-none-any.whl.

File metadata

Download URL: tometo_tomato-0.0.1-py3-none-any.whl
Upload date: Aug 24, 2025
Size: 13.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.2

File hashes

Hashes for tometo_tomato-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`560032d415b94aa5ad3ae09221778ea64200d5dcdb99878729c81267c9f8690f`
MD5	`146b48529b32388440f9cb1287c881f9`
BLAKE2b-256	`ce8ed8c6a1836bfec2b2c96da7b9ad528b20a8661b8655398b3a5af7ea0412fc`

See more details on using hashes here.

tometo-tomato 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tometo_tomato

Features

Installation

Prerequisites

Install from source

Usage

Basic example

Basic example

Automated Processing

Handling Field Names with Spaces

Normalization Options

Remove Special Characters with --keep-alphanumeric

Character Normalization with --latinize

Case Sensitivity Control

Whitespace Handling

Main parameters

Logic and Behavior

Output

Use case example

Notes

Development

Running tests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes