Skip to main content

Lightweight, pandas-native probabilistic record deduplication and entity resolution

Project description

record-linker

PyPI version Python Versions CI Coverage License: MIT Ruff

record-linker is a lightweight, pandas-native Python library for probabilistic record deduplication and entity resolution — no training data required, no steep configuration overhead, works out of the box. Point it at a messy CSV of missing persons, voter rolls, or beneficiary records and get back a clean DataFrame annotated with cluster IDs and match confidence scores.


What It Does

record-linker finds duplicate records in real-world datasets where names are misspelled, dates are formatted differently, and addresses are abbreviated. It uses configurable blocking rules to avoid comparing every pair of records, fuzzy comparators to measure field-level similarity, and connected-components clustering to group likely duplicates — all returning a confidence-scored, auditable result.

Motivating example: A humanitarian NGO receives beneficiary lists from three field offices. Each office uses different name spellings and date formats. record-linker deduplicates the merged list in seconds, flagging 847 duplicate registrations out of 12,000 records.


Installation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

record_linker-0.1.0.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

record_linker-0.1.0-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file record_linker-0.1.0.tar.gz.

File metadata

  • Download URL: record_linker-0.1.0.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.13.12 Darwin/24.6.0

File hashes

Hashes for record_linker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 35c6addf6b736a604395fb1f2bca8e69850c35e30031e46550890dfe91f3cef2
MD5 43d0befc781a7e5aaf353653ccf98eae
BLAKE2b-256 5d2e22c76cc51ff7fca71f7aabcb0166adb33d9d4933f7173965331ac9d55e39

See more details on using hashes here.

File details

Details for the file record_linker-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: record_linker-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.13.12 Darwin/24.6.0

File hashes

Hashes for record_linker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d051e6ec65ee0d85c1d6e4cf623849c5d2f60daa90f5b7cf5dfdf14d6266a67
MD5 8529295d25aaadf2796ef57181c14d55
BLAKE2b-256 b9d9b41bd24fee3aa32ed6513321df14902a0dfc9599a034266e64299abd6e96

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page