Skip to main content

Set-aware record linkage with structure-preserving joins

Project description

setjoin

PyPI version Python versions License CI

Record linkage that keeps groups together. Match persons while preserving household membership, students while respecting school assignments, or any hierarchical data where group integrity matters.

The Problem

Standard record linkage matches individuals optimally but ignores group structure. When household members should stay together, Hungarian matching might send them to different target households because it maximizes individual scores.

Example: Two source households, each with 2 members:

  • Household 1: Alice (age 35), Bob (age 10)
  • Household 2: Carol (age 45), Dan (age 42)

Target records (shuffled, unknown household IDs):

  • Record A: age 36, Record B: age 11, Record C: age 44, Record D: age 43

Suppose A and C are in target household X, while B and D are in target household Y.

Person-level Hungarian might match Alice→A, Bob→B, Carol→C, Dan→D. This maximizes individual match quality but splits both source families across different target households.

Structure-aware matching first assigns household→household, then matches within. Result: all source household members map to the same target household.

Why It Matters

In simulations with realistic ambiguity:

Method Group Coherence Person Accuracy Downstream Bias
Hungarian (person-level) 12% 32% 3.7
Structure-aware 52% 48% 2.3

Structure-aware matching achieves 4x better group coherence while also improving person-level accuracy. When groups are preserved, downstream analyses (treatment effects, household income, etc.) have less bias.

When to Use

Use setjoin when your records have known group structure that must be preserved:

  • Household surveys → Admin records: All family members must link to the same admin household
  • Student rosters → Test scores: Students in class 3A must all map to the same target class
  • Employee lists → Payroll: Workers at Plant X must link to the same payroll entity

Don't use setjoin if:

  • Records are truly independent (no group structure)
  • You want to discover groups (use clustering instead)

How It Works

Structure-aware matching uses two-level assignment:

  1. Compute group scores: For each (source_group, target_group) pair, solve the optimal within-group assignment using Hungarian algorithm. The group score is the sum of matched record scores.

  2. Assign groups: Apply Hungarian algorithm to the group score matrix to find the best overall group-to-group pairing.

  3. Extract matches: From matched groups, use the within-group assignments.

This guarantees all records in a source group map to a single target group while maximizing total match quality.

Installation

pip install setjoin

Quick Start

import numpy as np
from setjoin import match, HierarchySpec

# Score matrix: how well does each source record match each target?
scores = np.array([
    [10.0, 2.0, 1.0, 1.0],  # Person A scores high with targets 0,1
    [9.0, 10.0, 1.0, 1.0],  # Person B scores high with targets 0,1
    [1.0, 1.0, 10.0, 2.0],  # Person C scores high with targets 2,3
    [1.0, 1.0, 9.0, 10.0],  # Person D scores high with targets 2,3
])

# Define household structure: persons 0,1 are in household 0; persons 2,3 in household 1
hierarchy = HierarchySpec(
    source_groups={0: [0, 1], 1: [2, 3]},
    target_groups={0: [0, 1], 1: [2, 3]},
)

# Match while keeping households together
result = match(scores, method="structure_aware", hierarchy=hierarchy)
print(result.matches)  # [(0, 0), (1, 1), (2, 2), (3, 3)]
print(result.group_assignments)  # {0: 0, 1: 1} - household mappings

Examples

Basic Matching (No Hierarchy)

import numpy as np
from setjoin import hungarian_match, greedy_match

scores = np.array([
    [10.0, 1.0, 1.0],
    [1.0, 10.0, 1.0],
    [1.0, 1.0, 10.0],
])

# Optimal global assignment
result = hungarian_match(scores)
print(result.matches)  # [(0, 0), (1, 1), (2, 2)]
print(result.total_score)  # 30.0

# Fast greedy alternative
result = greedy_match(scores)

Building Scores from DataFrames

import pandas as pd
from setjoin import Scorer, FieldConfig

source = pd.DataFrame({"age": [25, 30, 35], "income": [50000, 60000, 70000]})
target = pd.DataFrame({"age": [26, 31, 34], "income": [51000, 59000, 72000]})

scorer = Scorer({
    "age": FieldConfig(weight=1.0, comparator="abs_diff"),
    "income": FieldConfig(weight=0.001, comparator="abs_diff"),
})
scores = scorer.score(source, target)

Structure-Aware Matching (Groups)

import pandas as pd
from setjoin import match, HierarchySpec, Scorer, FieldConfig

# Survey data with household IDs
survey = pd.DataFrame({
    "household_id": [1, 1, 2, 2],
    "age": [35, 10, 45, 42],
    "income": [50000, 0, 60000, 58000],
})

# Admin records with household IDs
admin = pd.DataFrame({
    "household_id": [101, 101, 102, 102],
    "age": [36, 11, 44, 43],
    "income": [51000, 0, 59000, 57000],
})

# Build score matrix (higher = better match, abs_diff returns negative distances)
scorer = Scorer({
    "age": FieldConfig(weight=1.0, comparator="abs_diff"),
    "income": FieldConfig(weight=0.0001, comparator="abs_diff"),
})
scores = scorer.score(survey, admin)

# Define hierarchy from dataframes
hierarchy = HierarchySpec.from_dataframe(
    survey, admin,
    source_group_col="household_id",
    target_group_col="household_id",
)

# Match: all members of survey household 1 -> same admin household
result = match(scores, method="structure_aware", hierarchy=hierarchy)

Soft Matching (Uncertainty)

import numpy as np
from setjoin import soft_match

scores = np.array([
    [10.0, 9.0],
    [9.0, 10.0],
])

# Get probabilistic weights instead of hard assignments
weights = soft_match(scores, regularization=0.5)
print(weights.matrix)  # Soft assignment probabilities
print(weights.to_hard())  # Convert to hard matches when needed

Calibration to Known Marginals

import numpy as np
import pandas as pd
from setjoin import calibrated_match, CalibrationSpec

scores = np.eye(100) * 10  # 100 records
source_df = pd.DataFrame({"region": ["north"] * 60 + ["south"] * 40})

# Target: 50/50 split, not the 60/40 in source
calibration = CalibrationSpec(
    margins={"region": {"north": 0.5, "south": 0.5}}
)

result = calibrated_match(scores, source_df, calibration)
print(result.weights)  # Calibration weights for each match
print(result.calibration_achieved)  # Achieved proportions

API Overview

Function Purpose
match() Main entry point - routes to greedy, hungarian, or structure_aware
hungarian_match() Optimal 1-to-1 assignment maximizing total score
greedy_match() Fast heuristic picking highest scores first
structure_aware_match() Optimal assignment preserving group structure
soft_match() Probabilistic weights via entropy-regularized transport
calibrated_match() Match + rake weights to hit target marginals
Scorer Build score matrices from DataFrames with configurable comparators
HierarchySpec Define group structure for structure-aware matching
CalibrationSpec Define target marginal distributions

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

setjoin-0.1.0.tar.gz (336.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

setjoin-0.1.0-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file setjoin-0.1.0.tar.gz.

File metadata

  • Download URL: setjoin-0.1.0.tar.gz
  • Upload date:
  • Size: 336.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for setjoin-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0ce9a47ba3ccfcacfdc41a758b0cfa5052faabee581a7e2fb99db9d9d3fbb6ef
MD5 290e8f917aecb4f54f4d756b1313768f
BLAKE2b-256 c6f37874f0beac19688877a6ed404c7874e6d9e0f29073eb02e9ca7573ebbc60

See more details on using hashes here.

File details

Details for the file setjoin-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: setjoin-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for setjoin-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b8c8ec85c52e93d84730136dc4d2cc5ad92dee622613a8ab645c9a6c5a47a90
MD5 5fe0e946e7759dcef184a05c7301c160
BLAKE2b-256 b3f89ad97a9340e5a3a33f2feaa1e0b9b7d18af58b85d0045284b0258ad37098

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page