Valentine Matcher

These details have not been verified by PyPI

Project links

Project description

Valentine: Matching DataFrames Easily

A python package for capturing potential relationships among columns of different tabular datasets, which are given in the form of pandas DataFrames. Valentine is based on Valentine: Evaluating Matching Techniques for Dataset Discovery

Installation instructions

To install Valentine simply run

pip install valentine

Installation requirements

Python 3.7 or later

Usage

Valentine can be used to find matches among columns of a given pair of pandas DataFrames.

Matching methods

In order to do so, the user can choose one of the following 5 matching methods:

Coma(int: max_n str: strategy) is a python wrapper around COMA 3.0 Comunity edition
- Parameters:
  - max_n(int) - Accept similarity threshold, default is 0.
  - strategy(str) - Choice of "COMA_OPT" (schema based matching - default) or "COMA_OPT_INST" (schema and instance based matching)
Cupid(float: w_struct, float: leaf_w_struct, float: th_accept) is the python implementation of the paper Generic Schema Matching with Cupid * Parameters:
- w_struct(float) - Structural similarity threshold, default is 0.2.
- leaf_w_struct(float) - Structural similarity threshold, leaf level, default is 0.2.
- th_accept(float) - Accept similarity threshold, default is 0.7.
DistributionBased(float: threshold1, float: threshold2) is the python implementation of the paper Automatic Discovery of Attributes in Relational Databases * Parameters:
- threshold1(float) - The threshold for phase 1 of the method, default is 0.15.
- threshold2(float) - The threshold for phase 2 of the method, default is 0.15.
JaccardLevenMatcher(float: threshold_leven) is a baseline method that uses Jaccard Similarity between columns to assess their correspondence score, enhanced by Levenshtein Distance * Parameters:
- threshold_leven(float) - Levenshtein ratio threshold for deciding whether two instances are same or not, default is 0.8.
SimilarityFlooding(str: coeff_policy, str: formula) is the python implementation of the paper Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matching
- Parameters:
  - coeff_policy(str) - Policy for deciding the weight coefficients of the propagation graph. Choice of "inverse_product" or "inverse_average" (default).
  - formula(str) - Formula on which iterative fixpoint computation is based. Choice of "basic", "formula_a", "formula_b" and "formula_c" (default).

Matching DataFrames

After selecting one of the 5 matching methods, the user can initiate the matching process in the following way:

matches = valentine_match(df1, df2, matcher, df1_name, df2_name)

where df1 and df2 are the two pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input a name for each DataFrame (defaults are "table_1" and "table_2"). Function valentine_match returns a dictionary storing as keys column pairs from the two DataFrames and as keys the corresponding similarity scores.

Measuring effectiveness

Based on the matches retrieved by calling valentine_match the user can use

metrics = valentine_metrics.all_metrics(matches, ground_truth)

in order to get all effectiveness metrics, such as Precision, Recall, F1-score and others as described in the original Valentine paper. In order to do so, the user needs to also input the ground truth of matches based on which the metrics will be calculated. The ground truth can be given as a list of tuples representing column matches that should hold.

Example

The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about authors and their publications, and then 2) how to assess its effectiveness based on a given ground truth (as found in valentine_example.py):

# Load data using pandas
d1_path = os.path.join('data', 'authors1.csv')
d2_path = os.path.join('data', 'authors2.csv')
df1 = pd.read_csv(d1_path)
df2 = pd.read_csv(d2_path)

# Instantiate matcher and run
matcher = Coma(strategy="COMA_OPT")
matches = valentine_match(df1, df2, matcher)

print(matches)

# If ground truth available valentine could calculate the metrics
ground_truth = [('Cited by', 'Cited by'),
                ('Authors', 'Authors'),
                ('EID', 'EID')]

metrics = valentine_metrics.all_metrics(matches, ground_truth)
    
print(metrics)

The output of the above code block is:

{(('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.8374313, 
(('table_1', 'Authors'), ('table_2', 'Authors')): 0.83498037, 
(('table_1', 'EID'), ('table_2', 'EID')): 0.8214057}
{'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0, 
'precision_at_10_percent': 1.0, 
'precision_at_30_percent': 1.0,
'precision_at_50_percent': 1.0, 
'precision_at_70_percent': 1.0, 
'precision_at_90_percent': 1.0, 
'recall_at_sizeof_ground_truth': 1.0}

Experimental suite version

The original experimental suite version of Valentine, as first published for the needs of the research paper, can be still found here.

Project page

The project page containing information about the research supporting Valentine can be accessed here.

Cite Valentine

Original Valentine paper:
@inproceedings{koutras2021valentine,
  title={Valentine: Evaluating Matching Techniques for Dataset Discovery},
  author={Koutras, Christos and Siachamis, George and Ionescu, Andra and Psarakis, Kyriakos and Brons, Jerry and Fragkoulis, Marios and Lofi, Christoph and Bonifati, Angela and Katsifodimos, Asterios},
  booktitle={2021 IEEE 37th International Conference on Data Engineering (ICDE)},
  pages={468--479},
  year={2021},
  organization={IEEE}
}
Demo Paper:
@article{koutras2021demo,
  title={Valentine in Action: Matching Tabular Data at Scale},
  author={Koutras, Christos and Psarakis, Kyriakos and Ionescu, Andra and Fragkoulis, Marios and Bonifati, Angela and Katsifodimos, Asterios},
  journal={VLDB},
  volume={14},
  number={12},
  pages={2871--2874},
  year={2021},
  publisher={VLDB Endowment}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Aug 22, 2024

0.2.0

Feb 14, 2024

0.1.9

Nov 9, 2023

0.1.8

Oct 13, 2023

0.1.7

May 18, 2023

0.1.6

Apr 11, 2023

0.1.5

Oct 25, 2022

0.1.4

Nov 16, 2021

0.1.3

Oct 19, 2021

0.1.2

Oct 18, 2021

0.1.1

Oct 18, 2021

This version

0.1.0

Oct 5, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

valentine-0.1.0.tar.gz (38.2 MB view hashes)

Uploaded Oct 5, 2021 Source

Hashes for valentine-0.1.0.tar.gz

Hashes for valentine-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4358713ff85305e5b661bf59ec1f9a3c8c1327beb0022bb5d42bbfa4e5867b10`
MD5	`a54307086e42dbc719e816e092d1a001`
BLAKE2b-256	`2ff407f0d04bb15b5a5e21bc5ff8131c410b5e51e23b3326502bf1a1df729609`