Skip to main content

Auto-Program Fuzzy Similarity Joins Without Labeled Examples

Project description

AutoFJ

The official code for our SIGMOD 2021 paper: Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples. To reproduce the main results in our paper, switch to reproduce branch.

AutoFJ automatically produces record pairs that approximately match in two input tables without requiring explicit human input such as labeled training data. Using AutoFJ, users only need to provide two input tables, and a desired precision target (say 0.9). AutoFJ leverages the fact that one of the input is a reference table to automatically program fuzzy-joins that meet the precision target in expectation, while maximizing fuzzy-join recall (defined as the number of correctly joined records).

In AutoFJ, the left table refers to a reference table, which is assumed to be almost "duplicate-free". AutoFJ attemps to solve many-to-one join problems, where each record in the right table will be joined with at most one record in the left table, but each record in left table can be joined with multiple records in the right table.

AutoFJ also provides a benchmark that contains 50 diverse datasets for single-column fuzzy-join tasks constructed from DBPedia.

Installation

Install the package using pip

pip install autofj

Usage

Let left_table be the reference table and right_table be another input table. The two tables are assumed to have the same schema and have an id column named id_column. To join left_table and right_table with precision target 0.9, run the following code. The result will be a joined table of tuple pairs that are identified as matches from two input tables.

from autofj import AutoFJ
fj = AutoFJ(precision_target=0.9)
result = fj.join(left_table, right_table, id_column)

To load a benchmark dataset, run the following code. Each dataset contains a left table (reference table), a right table and a ground-truth table of matched tuple pairs. The id column of each dataset is named as "id" and the column to be joiend is named as "title". The names of datasets are listed here.

from autofj.datasets import load_data
left_table, right_table, gt_table = load_data(dataset_name)

Example

Run the following code to join the left and right table of TennisTournament dataset.

from autofj.datasets import load_data
from autofj import AutoFJ
left_table, right_table, gt_table = load_data("TennisTournament")
fj = AutoFJ(precision_target=0.9)
result = fj.join(left_table, right_table, "id")

Documentation

class AutoFJ(object):
    def __init__(self,
                 precision_target=0.9,
                 join_function_space="autofj_sm",
                 distance_threshold_space=50,
                 column_weight_space=10,
                 blocker=None,
                 n_jobs=-1,
                 verbose=False):

Parameters

  • precision_target: float, default=0.9
    Precision target. The value is taken from 0-1. The default value is 0.9.

  • join_function_space: string, dict or list of objects, default="autofj_sm"
    Space of join functions. There are three ways to define the space of join functions:

    1. Use the name (string) of built-in join function space. There are three options, including "autofj_lg", "autofj_md" and "autofj_sm" that use 136, 68 and 14 join functions, respectively. Using less join functions can improve efficiency but may worsen performance.
    2. Use a dict specifying the options for preprocessing methods, tokenization methods, token weighting methods and distance functions. The space will be the cartesian product of all options in the dict. See options.py for defining join functions using a dict.
    3. Use a list of customized JoinFunction objects. Define JoinFuntion class using prototype in join_funtion.py.
  • distance_threshold_space: int or list of floats, default=50
    The number of candidate distance thresholds or a list of candidate distance thresholds in the space. If the number of distance thresholds (integer) is given, distance thresholds are spaced evenly from 0 to 1. Otherwise, it should be a list of floats from 0 to 1. Using less candidates can improve efficiency but may worsen performance.

  • column_weight_space: int or list of floats, default=10
    The number of candidate column weights or a list of candidate column weights in the space. If the number of column weights (integer) is given, column weights are spaced evenly from 0 to 1. Otherwise, it should be a list of floats from 0 to 1. Using less candidates can improve efficiency but may worsen performance.

  • blocker: None or a Blocker object, default None
    A Blocker object that performs blocking on two tables. If None, use the built-in blocker. For using customized blocker, define Blocker class using prototype in blocker.py.

  • n_jobs : int, default=-1
    Number of CPU cores used. -1 means using all processors.

  • verbose: bool, default=False
    Whether to print logging

Attributes

  • selected_column_weights: dict
    The columns and column weights selected by the algorithm. The key is the column name, the value is the weight selected for the column.

  • self.selected_join_configs: list of tuples
    The union of join configurations selected by the algorithm. Each tuple (join_function, threshold) in the list is a join configuration that consists of the name of the join function and its distance threshold.

Methods

join(left_table, right_table, id_column, on=None)   #Join left table and right table

Parameters

  • left_table: pd.DataFrame
    Reference table. The left table is assumed to be almost duplicate-free, which means it has no or only few duplicates.

  • right_table: pd.DataFrame
    Another input table.

  • id_column: string
    The name of id column in the two tables. This column will not be used to join two tables.

  • on: list or None, default=None
    A list of column names (multi-column fuzzy join) that the two tables will be joined on. If None, two tables will be joined on all columns that exist in both tables, excluding the id column.

Return

  • pd.DataFrame
    A table of joining pairs. The columns of left table are suffixed with "_l" and the columns of right table are suffixed with "_r".

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autofj-0.0.4.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

autofj-0.0.4-py3-none-any.whl (2.2 MB view details)

Uploaded Python 3

File details

Details for the file autofj-0.0.4.tar.gz.

File metadata

  • Download URL: autofj-0.0.4.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.0 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.10

File hashes

Hashes for autofj-0.0.4.tar.gz
Algorithm Hash digest
SHA256 d4f1d5c8f82ae8033c369011e8269456a17928456c1196215b27c3fe9945475c
MD5 e5cbf6b781cd13cf6407a09d405e79d9
BLAKE2b-256 fd35648e5bce69c9f3b016ece84735641f0d7ddab7aa405cede8528605dc3b5b

See more details on using hashes here.

File details

Details for the file autofj-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: autofj-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.0 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.10

File hashes

Hashes for autofj-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b874360ca213f01ec9c1aaefec94165f0beac8987a4130ba15bc1c02f10b3152
MD5 dae2695aadf89f093e11efd137080ebb
BLAKE2b-256 99c8ca489e92a5979b595f3465c89e38c52a3d8f3db3733a33d689bc959b5a1c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page