Extensible entity resolution framework
Project description
ResolvER
ResolvER is an extensible framework for building Entity Resolution pipelines in order to merge datasets around "things" based on complex join logic and transitive linking.
Entity Resolution is a complex and computationally expensive process. ResolvER seeks to provide tools that cover the majority of use cases, the ability to enhance those tools with machine learning, and leverage developers' experiential knowledge of data to provide a flexible and efficient solution to the Entity Resolution problem.
Quick/Simple Example
The University of Leipzig provides test datasets for Entity Resolution, let's say you're working with the DBLP-ACM dataset.
The dataset provides two files, both describing published papers, with similar columns:
- A unique id (unique to that file only)
- A title
- A list of authors
- A venue
- A year
The titles vary slightly between files, and different authors may be listed for a given paper - in short, there's no clean or consistent way to deduplicate the data.
Using ResolvER, you can specify any number of complex operations to determine whether two given records are duplicates of each other in what are called "Strategies." One such strategy may be:
title
has Levenshtein ratio of at least 0.9 ANDauthors
has Jaccard distance of at least 0.4 ANDyear
is an exact match
In Depth examples
For more in depth workflows and explanations of the methodology, reference the notebooks folder.
Install
Install the latest version of ResolvER:
$ pip install entity-resolution
Research
ResolvER is (most notably) inspired by the below publications:
- Collective Entity Resolution in Relational Data
- Comparative Analysis of Approximate Blocking Techniques for Entity Resolution
License
Released under standard MIT license (see LICENSE.txt):
Copyright (c) 2021 ResolvER Developers
Carl Best
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file entity-resolution-0.1.0.tar.gz
.
File metadata
- Download URL: entity-resolution-0.1.0.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae52c9c0cdec49c1ab8a3609d116a6f1bfb97d902ca03194af93e41fbf343543 |
|
MD5 | 1cd25530c43b26dee907fa66603a8c29 |
|
BLAKE2b-256 | 3899ed89589b990aac877ac7f4f2af0579aa850de6e474c1ca015b3986edc479 |
File details
Details for the file entity_resolution-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: entity_resolution-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5377849c7740224a396f59ae562cf1533bde98c58f2c0b6260b0e4ab82237ae3 |
|
MD5 | 62a56a86f19f477d0a407194094a8e79 |
|
BLAKE2b-256 | e04afa97f9a2e4f94044ff2e9186e4b0b68a23020ff1ad06549d34b8a65faddd |