Skip to main content

Extensible entity resolution framework

Project description

ResolvER

ResolvER is an extensible framework for building Entity Resolution pipelines in order to merge datasets around "things" based on complex join logic and transitive linking.

Entity Resolution is a complex and computationally expensive process. ResolvER seeks to provide tools that cover the majority of use cases, the ability to enhance those tools with machine learning, and leverage developers' experiential knowledge of data to provide a flexible and efficient solution to the Entity Resolution problem.

Quick/Simple Example

The University of Leipzig provides test datasets for Entity Resolution, let's say you're working with the DBLP-ACM dataset.

The dataset provides two files, both describing published papers, with similar columns:

  • A unique id (unique to that file only)
  • A title
  • A list of authors
  • A venue
  • A year

The titles vary slightly between files, and different authors may be listed for a given paper - in short, there's no clean or consistent way to deduplicate the data.

Using ResolvER, you can specify any number of complex operations to determine whether two given records are duplicates of each other in what are called "Strategies." One such strategy may be:

  • title has Levenshtein ratio of at least 0.9 AND
  • authors has Jaccard distance of at least 0.4 AND
  • year is an exact match

In Depth examples

For more in depth workflows and explanations of the methodology, reference the notebooks folder.

Install

Install the latest version of ResolvER:

$ pip install entity-resolution

Research

ResolvER is (most notably) inspired by the below publications:

License

Released under standard MIT license (see LICENSE.txt):

Copyright (c) 2021 ResolvER Developers
Carl Best

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entity-resolution-0.1.0.tar.gz (16.0 kB view hashes)

Uploaded Source

Built Distribution

entity_resolution-0.1.0-py3-none-any.whl (18.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page