Skip to main content

Extensible entity resolution framework

Project description

ResolvER

ResolvER is an extensible framework for building Entity Resolution pipelines in order to merge datasets around "things" based on complex join logic and transitive linking.

Entity Resolution is a complex and computationally expensive process. ResolvER seeks to provide tools that cover the majority of use cases, the ability to enhance those tools with machine learning, and leverage developers' experiential knowledge of data to provide a flexible and efficient solution to the Entity Resolution problem.

Quick/Simple Example

The University of Leipzig provides test datasets for Entity Resolution, let's say you're working with the DBLP-ACM dataset.

The dataset provides two files, both describing published papers, with similar columns:

  • A unique id (unique to that file only)
  • A title
  • A list of authors
  • A venue
  • A year

The titles vary slightly between files, and different authors may be listed for a given paper - in short, there's no clean or consistent way to deduplicate the data.

Using ResolvER, you can specify any number of complex operations to determine whether two given records are duplicates of each other in what are called "Strategies." One such strategy may be:

  • title has Levenshtein ratio of at least 0.9 AND
  • authors has Jaccard distance of at least 0.4 AND
  • year is an exact match

In Depth examples

For more in depth workflows and explanations of the methodology, reference the notebooks folder.

Install

Install the latest version of ResolvER:

$ pip install entity-resolution

Research

ResolvER is (most notably) inspired by the below publications:

License

Released under standard MIT license (see LICENSE.txt):

Copyright (c) 2021 ResolvER Developers
Carl Best

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entity-resolution-0.1.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

entity_resolution-0.1.0-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

File details

Details for the file entity-resolution-0.1.0.tar.gz.

File metadata

  • Download URL: entity-resolution-0.1.0.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.6

File hashes

Hashes for entity-resolution-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ae52c9c0cdec49c1ab8a3609d116a6f1bfb97d902ca03194af93e41fbf343543
MD5 1cd25530c43b26dee907fa66603a8c29
BLAKE2b-256 3899ed89589b990aac877ac7f4f2af0579aa850de6e474c1ca015b3986edc479

See more details on using hashes here.

File details

Details for the file entity_resolution-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: entity_resolution-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.6

File hashes

Hashes for entity_resolution-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5377849c7740224a396f59ae562cf1533bde98c58f2c0b6260b0e4ab82237ae3
MD5 62a56a86f19f477d0a407194094a8e79
BLAKE2b-256 e04afa97f9a2e4f94044ff2e9186e4b0b68a23020ff1ad06549d34b8a65faddd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page