Skip to main content

Record linkage - simple, flexible, efficient.

Project description

MatChain: Simple, Flexible, Efficient

MatChain is an experimental package designed for record linkage. Record linkage is the process of matching records that correspond to the same real-world entity in two or more datasets. This process typically includes several steps, such as blocking and the final matching decision, with a wide range of methods available, including probabilistic, rule-based, and machine learning approaches.

MatChain was created with three core objectives in mind: simplicity, flexibility, and efficiency. It focuses on unsupervised approaches to minimize manual efforts, allows for customization of matching steps, and offers fast and resource-efficient implementations.

MatChain leverages libraries like Pandas, NumPy, and SciPy for vectorized data handling, advanced indexing, and support for sparse matrices. It also utilizes scikit-learn and SentenceTransformers to convert strings into sparse vectors and dense vectors, respectively. This allows to perform blocking as approximate nearest neighbour search in the resulting set of vectors utilizing libraries like NMSLIB and Faiss.

The currently published version of MatChain exclusively provides AutoCal as the matching algorithm. AutoCal is an unsupervised method initially designed for instance matching with Ontomatch in the context of The World Avatar. MatChain's implementation is highly efficient and allows for the combination of AutoCal with various procedures for blocking and computing similarity scores.

Installation

MatChain requires Python 3.8 or higher and can be installed with pip:

pip install matchain

However, this only installs PyTorch's CPU version. If you want to use the GPU version, you need to install it separately:

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

Basic Example Using the API

In this example, we demonstrate how to match two datasets, denoted as A and B, based on columns with the same names: "year," "title," "authors," and "venue." You can run this example in the accompanying notebook run_matchain_api.ipynb, which provides a detailed explanation of MatChain's API, including how to specify parameters.

First, we read the data and initialize an instance of the class MatChain using Pandas' dataframes.

data_dir = './data/Structured/DBLP-ACM'
dfa = pd.read_csv(f'{data_dir}/tableA.csv')
dfb = pd.read_csv(f'{data_dir}/tableB.csv')

mat = matchain.api.MatChain(dfa, dfb)

Next, we specify one or more similarity functions for each matching column by the property method. These similarity functions calculate scores between 0 and 1 for pairs of column values. In this example, we use equal for the integer-valued "year" column, which returns 1 if two years are equal and 0 otherwise. For each of the remaining string-valued columns, we apply shingle_tfidf to generate a sparse vector for each string based on its shingles (n-grams on the character level) and compute the cosine similarity between the sparse vectors for pairs of strings:

mat.property('year', simfct='equal')
mat.property('title', simfct='shingle_tfidf')
mat.property('authors', simfct='shingle_tfidf')
mat.property('venue', simfct='shingle_tfidf')

As the total number of record pairs grows with the product of the record sizes in datasets A and B, classifying each pair as matching or non-matching can be computationally expensive, especially for large datasets. Blocking effectively reduces the number of pairs while only discarding a small fraction of true matching pairs. The following line specifies three columns to use for blocking. By default, MatChain utilizes the library sparsedottopn to perform blocking by conducting a nearest neighbor search on the same shingle vectors mentioned earlier:

mat.blocking(blocking_props=['title', 'authors', 'venue'])

Finally, we call autocal to execute the matching algorithm AutoCal and predict to get the predicted matching pairs:

mat.autocal()
predicted_matches = mat.predict()

Configuration File

While the example above demonstrates how to use MatChain's API to match two datasets, an alternative and streamlined approach is to utilize a configuration file. This method allows us to specify datasets, matching chains, and parameters in a separate file:

python matchain --config ./config/mccommands.yaml

For more detailed information about configuration options, run the notebook run_matchain_config.ipynb.

Datasets

The data subdirectory includes pairs of example datasets and ground truth data for evaluating MatChain's performance. These datasets cover various domains, including restaurant, bibliography, product, and powerplants. Specifically, four of them originate from this paper and were downloaded from the DeepMatcher Data Repository. Two additional dataset pairs are related to the powerplants domain and were originally used for AutoCal.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matchain-0.1.2.tar.gz (58.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matchain-0.1.2-py3-none-any.whl (62.9 kB view details)

Uploaded Python 3

File details

Details for the file matchain-0.1.2.tar.gz.

File metadata

  • Download URL: matchain-0.1.2.tar.gz
  • Upload date:
  • Size: 58.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for matchain-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2830cd944c7346d81a8262f790da865625bf2719230958c85b81254b35c129f4
MD5 f249e5c1feb91113ac7d26cd1a2107be
BLAKE2b-256 8a591f0e6c5c79999b66dea71d57c4cab18c71c7ff9b5d4e404dac41f07d1a31

See more details on using hashes here.

File details

Details for the file matchain-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: matchain-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 62.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for matchain-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6145064dc343a3c609b93fbee9786ca851c574578441898a14cfe6620ca63492
MD5 2b9bfa290205ccb9fd1dd77a9201067d
BLAKE2b-256 252fde2c55a11b8647c570a366a3cfd3ec3879a0376527d868dbbe8397a8ea4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page