Record linkage - simple, flexible, efficient.

These details have not been verified by PyPI

Project links

repository

Project description

MatChain: Simple, Flexible, Efficient

MatChain is an experimental package designed for record linkage. Record linkage is the process of matching records that correspond to the same real-world entity in two or more datasets. This process typically includes several steps, such as blocking and the final matching decision, with a wide range of methods available, including probabilistic, rule-based, and machine learning approaches.

MatChain was created with three core objectives in mind: simplicity, flexibility, and efficiency. It focuses on unsupervised approaches to minimize manual efforts, allows for customization of matching steps, and offers fast and resource-efficient implementations.

MatChain leverages libraries like Pandas, NumPy, and SciPy for vectorized data handling, advanced indexing, and support for sparse matrices. It also utilizes scikit-learn and SentenceTransformers to convert strings into sparse vectors and dense vectors, respectively. This allows to perform blocking as approximate nearest neighbour search in the resulting set of vectors utilizing libraries like NMSLIB and Faiss.

The currently published version of MatChain exclusively provides AutoCal as the matching algorithm. AutoCal is an unsupervised method initially designed for instance matching with Ontomatch in the context of The World Avatar. MatChain's implementation is highly efficient and allows for the combination of AutoCal with various procedures for blocking and computing similarity scores.

Installation

MatChain requires Python 3.8 or higher and can be installed with pip:

pip install matchain

However, this only installs PyTorch's CPU version. If you want to use the GPU version, you need to install it separately:

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

Basic Example Using the API

In this example, we demonstrate how to match two datasets, denoted as A and B, based on columns with the same names: "year," "title," "authors," and "venue." You can run this example in the accompanying notebook run_matchain_api.ipynb, which provides a detailed explanation of MatChain's API, including how to specify parameters.

First, we read the data and initialize an instance of the class MatChain using Pandas' dataframes.

data_dir = './data/Structured/DBLP-ACM'
dfa = pd.read_csv(f'{data_dir}/tableA.csv')
dfb = pd.read_csv(f'{data_dir}/tableB.csv')

mat = matchain.api.MatChain(dfa, dfb)

Next, we specify one or more similarity functions for each matching column by the property method. These similarity functions calculate scores between 0 and 1 for pairs of column values. In this example, we use equal for the integer-valued "year" column, which returns 1 if two years are equal and 0 otherwise. For each of the remaining string-valued columns, we apply shingle_tfidf to generate a sparse vector for each string based on its shingles (n-grams on the character level) and compute the cosine similarity between the sparse vectors for pairs of strings:

mat.property('year', simfct='equal')
mat.property('title', simfct='shingle_tfidf')
mat.property('authors', simfct='shingle_tfidf')
mat.property('venue', simfct='shingle_tfidf')

As the total number of record pairs grows with the product of the record sizes in datasets A and B, classifying each pair as matching or non-matching can be computationally expensive, especially for large datasets. Blocking effectively reduces the number of pairs while only discarding a small fraction of true matching pairs. The following line specifies three columns to use for blocking. By default, MatChain utilizes the library sparsedottopn to perform blocking by conducting a nearest neighbor search on the same shingle vectors mentioned earlier:

mat.blocking(blocking_props=['title', 'authors', 'venue'])

Finally, we call autocal to execute the matching algorithm AutoCal and predict to get the predicted matching pairs:

mat.autocal()
predicted_matches = mat.predict()

Configuration File

While the example above demonstrates how to use MatChain's API to match two datasets, an alternative and streamlined approach is to utilize a configuration file. This method allows us to specify datasets, matching chains, and parameters in a separate file:

python matchain --config ./config/mccommands.yaml

For more detailed information about configuration options, run the notebook run_matchain_config.ipynb.

Datasets

The data subdirectory includes pairs of example datasets and ground truth data for evaluating MatChain's performance. These datasets cover various domains, including restaurant, bibliography, product, and powerplants. Specifically, four of them originate from this paper and were downloaded from the DeepMatcher Data Repository. Two additional dataset pairs are related to the powerplants domain and were originally used for AutoCal.

Project details

These details have not been verified by PyPI

Project links

repository

Release history Release notifications | RSS feed

This version

0.1.2

Nov 3, 2023

0.1.1

Nov 2, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matchain-0.1.2.tar.gz (58.6 kB view details)

Uploaded Nov 3, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

matchain-0.1.2-py3-none-any.whl (62.9 kB view details)

Uploaded Nov 3, 2023 Python 3

File details

Details for the file matchain-0.1.2.tar.gz.

File metadata

Download URL: matchain-0.1.2.tar.gz
Upload date: Nov 3, 2023
Size: 58.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for matchain-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`2830cd944c7346d81a8262f790da865625bf2719230958c85b81254b35c129f4`
MD5	`f249e5c1feb91113ac7d26cd1a2107be`
BLAKE2b-256	`8a591f0e6c5c79999b66dea71d57c4cab18c71c7ff9b5d4e404dac41f07d1a31`

See more details on using hashes here.

File details

Details for the file matchain-0.1.2-py3-none-any.whl.

File metadata

Download URL: matchain-0.1.2-py3-none-any.whl
Upload date: Nov 3, 2023
Size: 62.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for matchain-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6145064dc343a3c609b93fbee9786ca851c574578441898a14cfe6620ca63492`
MD5	`2b9bfa290205ccb9fd1dd77a9201067d`
BLAKE2b-256	`252fde2c55a11b8647c570a366a3cfd3ec3879a0376527d868dbbe8397a8ea4b`

See more details on using hashes here.

matchain 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MatChain: Simple, Flexible, Efficient

Installation

Basic Example Using the API

Configuration File

Datasets

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes