Valentine: find relationships between columns of different tabular datasets

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

delftdata

These details have not been verified by PyPI

Project description

Valentine 💘

(Schema-) Matching DataFrames Made Easy

A Python package for capturing potential relationships among columns of different tabular datasets, given as pandas DataFrames.
Valentine is based on the paper Valentine: Evaluating Matching Techniques for Dataset Discovery.

You can find more information about the research supporting Valentine here.

Experimental suite version

The original experimental suite version of Valentine, as first published for the needs of the research paper, can be still found here.

Installation instructions

Requirements

Python >=3.10,<3.15
Java: Only required for the legacy Coma matcher (deprecated). The recommended ComaPy matcher is pure Python and has no Java dependency.

To install Valentine simply run:

pip install valentine

Usage

Valentine can be used to find matches among columns of a given pair of pandas DataFrames.

Matching methods

In order to do so, the user can choose one of the following matching methods:

ComaPy(int: max_n, bool: use_instances, bool: use_schema, float: delta, float: threshold) is a pure Python implementation of the COMA 3.0 schema matching algorithm. It will be renamed to Coma in v1.0.0.
- Parameters:
  - max_n(int) - Maximum number of matches to keep per column, 0 means unlimited (default: 0).
  - use_instances(bool) - Whether to use TF-IDF instance-based matching on data values (default: False).
  - use_schema(bool) - Whether to use schema-based matching on column names, paths, and structure (default: True).
  - delta(float) - Fraction from the best score within which matches are kept (default: 0.15).
  - threshold(float) - Absolute minimum similarity score to keep a match (default: 0.0).
Coma(int: max_n, bool: use_instances, str: java_xmx) (deprecated, will be removed in v1.0.0) is a Java wrapper around COMA 3.0 Community edition. Requires Java. Use ComaPy instead.
- Parameters:
  - max_n(int) - Maximum number of matches to keep per column, (default: 0).
  - use_instances(bool) - Whether Coma will make use of the data instances or just the schema information, (default: False).
  - java_xmx(str) - The amount of RAM that Coma is allowed to use, (default: "1024m") .
Cupid(float: w_struct, float: leaf_w_struct, float: th_accept) is the python implementation of the paper Generic Schema Matching with Cupid
- Parameters:
  - w_struct(float) - Structural similarity threshold, default is 0.2.
  - leaf_w_struct(float) - Structural similarity threshold, leaf level, default is 0.2.
  - th_accept(float) - Accept similarity threshold, default is 0.7.
DistributionBased(float: threshold1, float: threshold2) is the python implementation of the paper Automatic Discovery of Attributes in Relational Databases
- Parameters:
  - threshold1(float) - The threshold for phase 1 of the method, default is 0.15.
  - threshold2(float) - The threshold for phase 2 of the method, default is 0.15.
JaccardDistanceMatcher(float: threshold_dist) is a baseline method that uses Jaccard Similarity between columns to assess their correspondence score, optionally enhanced by a string similarity measure of choice.
- Parameters:
  - threshold_dist(float) - Acceptance threshold for assessing two strings as equal, default is 0.8.
  - distance_fun(StringDistanceFunction) - String similarity function used to assess whether two strings are equal. The enumeration class type StringDistanceFunction can be imported from valentine.algorithms.jaccard_distance. Functions currently supported are:
    - StringDistanceFunction.Levenshtein: Levenshtein distance
    - StringDistanceFunction.DamerauLevenshtein: Damerau-Levenshtein distance
    - StringDistanceFunction.Hamming: Hamming distance
    - StringDistanceFunction.Jaro: Jaro distance
    - StringDistanceFunction.JaroWinkler: Jaro-Winkler distance * StringDistanceFunction.Exact: String equality ==
SimilarityFlooding(str: coeff_policy, str: formula) is the python implementation of the paper Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matching
- Parameters:
  - coeff_policy(str) - Policy for deciding the weight coefficients of the propagation graph. Choice of "inverse_product" or "inverse_average" (default).
  - formula(str) - Formula on which iterative fixpoint computation is based. Choice of "basic", "formula_a", "formula_b" and "formula_c" (default).

Matching DataFrame Pair

After selecting one of the matching methods, the user can initiate the pairwise matching process in the following way:

matches = valentine_match(df1, df2, matcher, df1_name, df2_name)

where df1 and df2 are the two pandas DataFrames for which we want to find matches and matcher is one of ComaPy, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input a name for each DataFrame (defaults are "table_1" and "table_2"). Function valentine_match returns a MatcherResults object, which is a dictionary with additional convenience methods, such as one_to_one, take_top_percent, get_metrics and more. It stores as keys column pairs from the two DataFrames and as values the corresponding similarity scores.

Matching DataFrame Batch

After selecting one of the matching methods, the user can initiate the batch matching process in the following way:

matches = valentine_match_batch(df_iter_1, df_iter_2, matcher, df_iter_1_names, df_iter_2_names)

where df_iter_1 and df_iter_2 are the two iterable structures containing pandas DataFrames for which we want to find matches and matcher is one of ComaPy, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input an iterable with names for each DataFrame. Function valentine_match_batch returns a MatcherResults object, which is a dictionary with additional convenience methods, such as one_to_one, take_top_percent, get_metrics and more. It stores as keys column pairs from the two DataFrames and as values the corresponding similarity scores.

MatcherResults instance

The MatcherResults instance has some convenience methods that the user can use to either obtain a subset of the data or to transform the data. This instance is a dictionary and is sorted upon instantiation, from high similarity to low similarity.

top_n_matches = matches.take_top_n(5)

top_n_percent_matches = matches.take_top_percent(25)

one_to_one_matches = matches.one_to_one()

Measuring effectiveness

The MatcherResults instance that is returned by valentine_match or valentine_match_batch also has a get_metrics method that the user can use

metrics = matches.get_metrics(ground_truth)

in order to get all effectiveness metrics, such as Precision, Recall, F1-score and others as described in the original Valentine paper. In order to do so, the user needs to also input the ground truth of matches based on which the metrics will be calculated. The ground truth can be given as a list of tuples representing column matches that should hold (see example below).

By default, all the core metrics will be used for this with default parameters, but the user can also customize which metrics to run with what parameters, and implement own custom metrics by extending from the Metric base class. Some sets of metrics are available as well.

from valentine.metrics import F1Score, PrecisionTopNPercent, METRICS_PRECISION_INCREASING_N
metrics_custom = matches.get_metrics(ground_truth, metrics={F1Score(one_to_one=False), PrecisionTopNPercent(n=70)})
metrics_prefefined_set = matches.get_metrics(ground_truth, metrics=METRICS_PRECISION_INCREASING_N)

Example

The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about authors and their publications, and then 2) how to assess its effectiveness based on a given ground truth (a more extensive example is shown in valentine_example.py):

import os
import pandas as pd
from valentine import valentine_match
from valentine.algorithms import ComaPy

# Load data using pandas
d1_path = os.path.join('data', 'authors1.csv')
d2_path = os.path.join('data', 'authors2.csv')
df1 = pd.read_csv(d1_path)
df2 = pd.read_csv(d2_path)

# Instantiate matcher and run
matcher = ComaPy(use_instances=True)
matches = valentine_match(df1, df2, matcher)

print(matches)

# If ground truth available valentine could calculate the metrics
ground_truth = [('Cited by', 'Cited by'),
                ('Authors', 'Authors'),
                ('EID', 'EID')]

metrics = matches.get_metrics(ground_truth)
    
print(metrics)

The output of the above code block is:

{
     (('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.86994505, 
     (('table_1', 'Authors'), ('table_2', 'Authors')): 0.8679843, 
     (('table_1', 'EID'), ('table_2', 'EID')): 0.8571245
}
{
     'Recall': 1.0, 
     'F1Score': 1.0, 
     'RecallAtSizeofGroundTruth': 1.0, 
     'Precision': 1.0, 
     'PrecisionTop10Percent': 1.0
}

Cite Valentine

Original Valentine paper:
@inproceedings{koutras2021valentine,
  title={Valentine: Evaluating Matching Techniques for Dataset Discovery},
  author={Koutras, Christos and Siachamis, George and Ionescu, Andra and Psarakis, Kyriakos and Brons, Jerry and Fragkoulis, Marios and Lofi, Christoph and Bonifati, Angela and Katsifodimos, Asterios},
  booktitle={2021 IEEE 37th International Conference on Data Engineering (ICDE)},
  pages={468--479},
  year={2021},
  organization={IEEE}
}
Demo Paper:
@article{koutras2021demo,
  title={Valentine in Action: Matching Tabular Data at Scale},
  author={Koutras, Christos and Psarakis, Kyriakos and Siachamis, George and Ionescu, Andra and Fragkoulis, Marios and Bonifati, Angela and Katsifodimos, Asterios},
  journal={VLDB},
  volume={14},
  number={12},
  pages={2871--2874},
  year={2021},
  publisher={VLDB Endowment}
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

delftdata

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.0

May 14, 2026

This version

0.5.0

Apr 5, 2026

0.4.1

Oct 14, 2025

0.3.0

Nov 28, 2024

0.2.1

Aug 22, 2024

0.2.0

Feb 14, 2024

0.1.9

Nov 9, 2023

0.1.8

Oct 13, 2023

0.1.7

May 18, 2023

0.1.6

Apr 11, 2023

0.1.5

Oct 25, 2022

0.1.4

Nov 16, 2021

0.1.3

Oct 19, 2021

0.1.2

Oct 18, 2021

0.1.1

Oct 18, 2021

0.1.0

Oct 5, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

valentine-0.5.0.tar.gz (38.3 MB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

valentine-0.5.0-py3-none-any.whl (38.3 MB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file valentine-0.5.0.tar.gz.

File metadata

Download URL: valentine-0.5.0.tar.gz
Upload date: Apr 5, 2026
Size: 38.3 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for valentine-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`37eabea582e0e3aee50bce64ee274d0f2f709682b5063d727aa4b2f6d8ee0450`
MD5	`b33e0690be2d10ab00b6d08114bb3653`
BLAKE2b-256	`21b03bbd77db15e9b1906051b0d3439c06db0d259ce731216ae83760e6106433`

See more details on using hashes here.

Provenance

The following attestation bundles were made for valentine-0.5.0.tar.gz:

Publisher: ci-build-test-publish.yml on delftdata/valentine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: valentine-0.5.0.tar.gz
- Subject digest: 37eabea582e0e3aee50bce64ee274d0f2f709682b5063d727aa4b2f6d8ee0450
- Sigstore transparency entry: 1238718953
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: delftdata/valentine@aa9d9c8ce7aeb258aea6da0b546d1f935f608cab
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/delftdata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci-build-test-publish.yml@aa9d9c8ce7aeb258aea6da0b546d1f935f608cab
- Trigger Event: release

File details

Details for the file valentine-0.5.0-py3-none-any.whl.

File metadata

Download URL: valentine-0.5.0-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 38.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for valentine-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a884d8729a3191d43fe3504f788549764b44a9547a7fe6d71ee0a42622b51fd7`
MD5	`c6f12be32f124e02e0fb99e57e9dc109`
BLAKE2b-256	`a2b25564ff131b9896d8ea2f21693bd13aabc6b96581e1b27c5e411ec827c40f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for valentine-0.5.0-py3-none-any.whl:

Publisher: ci-build-test-publish.yml on delftdata/valentine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: valentine-0.5.0-py3-none-any.whl
- Subject digest: a884d8729a3191d43fe3504f788549764b44a9547a7fe6d71ee0a42622b51fd7
- Sigstore transparency entry: 1238718955
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: delftdata/valentine@aa9d9c8ce7aeb258aea6da0b546d1f935f608cab
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/delftdata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci-build-test-publish.yml@aa9d9c8ce7aeb258aea6da0b546d1f935f608cab
- Trigger Event: release

valentine 0.5.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Valentine 💘

Experimental suite version

Installation instructions

Requirements

Usage

Matching methods

Matching DataFrame Pair

Matching DataFrame Batch

MatcherResults instance

Measuring effectiveness

Example

Cite Valentine

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance