Skip to main content

Record linkage - simple, flexible, efficient.

Project description

MatChain: Simple, Flexible, Efficient

MatChain is an experimental package designed for record linkage. Record linkage is the process of matching records that correspond to the same real-world entity in two or more datasets. This process typically includes several steps, such as blocking and the final matching decision, with a wide range of methods available, including probabilistic, rule-based, and machine learning approaches.

MatChain was created with three core objectives in mind: simplicity, flexibility, and efficiency. It focuses on unsupervised approaches to minimize manual efforts, allows for customization of matching steps, and offers fast and resource-efficient implementations.

MatChain leverages libraries like Pandas, NumPy, and SciPy for vectorized data handling, advanced indexing, and support for sparse matrices. It also utilizes scikit-learn and SentenceTransformers to convert strings into sparse vectors and dense vectors, respectively. This allows to perform blocking as approximate nearest neighbour search in the resulting set of vectors utilizing libraries like NMSLIB and Faiss.

The currently published version of MatChain exclusively provides AutoCal as the matching algorithm. AutoCal is an unsupervised method initially designed for instance matching with Ontomatch in the context of The World Avatar. MatChain's implementation is highly efficient and allows for the combination of AutoCal with various procedures for blocking and computing similarity scores.

Installation

MatChain requires Python 3.8 or higher and can be installed with pip:

pip install matchain

However, this only installs PyTorch's CPU version. If you want to use the GPU version, you need to install it separately:

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

Basic Example Using the API

In this example, we demonstrate how to match two datasets, denoted as A and B, based on columns with the same names: "year," "title," "authors," and "venue." You can run this example in the accompanying notebook run_matchain_api.ipynb, which provides a detailed explanation of MatChain's API, including how to specify parameters.

First, we read the data and initialize an instance of the class MatChain using Pandas' dataframes.

data_dir = './data/Structured/DBLP-ACM'
dfa = pd.read_csv(f'{data_dir}/tableA.csv')
dfb = pd.read_csv(f'{data_dir}/tableB.csv')

mat = matchain.api.MatChain(dfa, dfb)

Next, we specify one or more similarity functions for each matching column by the property method. These similarity functions calculate scores between 0 and 1 for pairs of column values. In this example, we use equal for the integer-valued "year" column, which returns 1 if two years are equal and 0 otherwise. For each of the remaining string-valued columns, we apply shingle_tfidf to generate a sparse vector for each string based on its shingles (n-grams on the character level) and compute the cosine similarity between the sparse vectors for pairs of strings:

mat.property('year', simfct='equal')
mat.property('title', simfct='shingle_tfidf')
mat.property('authors', simfct='shingle_tfidf')
mat.property('venue', simfct='shingle_tfidf')

As the total number of record pairs grows with the product of the record sizes in datasets A and B, classifying each pair as matching or non-matching can be computationally expensive, especially for large datasets. Blocking effectively reduces the number of pairs while only discarding a small fraction of true matching pairs. The following line specifies three columns to use for blocking. By default, MatChain utilizes the library sparsedottopn to perform blocking by conducting a nearest neighbor search on the same shingle vectors mentioned earlier:

mat.blocking(blocking_props=['title', 'authors', 'venue'])

Finally, we call autocal to execute the matching algorithm AutoCal and predict to get the predicted matching pairs:

mat.autocal()
predicted_matches = mat.predict()

Configuration File

While the example above demonstrates how to use MatChain's API to match two datasets, an alternative and streamlined approach is to utilize a configuration file. This method allows us to specify datasets, matching chains, and parameters in a separate file:

python matchain --config ./config/mccommands.yaml

For more detailed information about configuration options, run the notebook run_matchain_config.ipynb.

Datasets

The data subdirectory includes pairs of example datasets and ground truth data for evaluating MatChain's performance. These datasets cover various domains, including restaurant, bibliography, product, and powerplants. Specifically, four of them originate from this paper and were downloaded from the DeepMatcher Data Repository. Two additional dataset pairs are related to the powerplants domain and were originally used for AutoCal.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matchain-0.1.2.tar.gz (58.6 kB view hashes)

Uploaded Source

Built Distribution

matchain-0.1.2-py3-none-any.whl (62.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page