Reusable Dagster component package for HSDS entity resolution workflows.
Project description
hsds-record-matcher
Reusable Dagster component for incremental entity resolution on HSDS (Human Services Data Specification) records.
Install:
pip install hsds-record-matcher
What this package does
Community 211 directories and social-services platforms often ingest the same organizations and services from multiple data partners. Over time those records diverge — different spellings, phone formats, partial addresses — making it hard to know which rows describe the same real-world entity.
hsds-record-matcher provides a single Dagster component, EntityResolutionComponent, that
runs an incremental seven-stage entity-resolution pipeline on your HSDS data:
| Stage | What happens |
|---|---|
| Clean entities | Normalize contact fields, compute content hashes, detect adds/changes/removals since the last run |
| Generate candidates | Block on overlap signals (email, phone, domain, taxonomy, location) to produce candidate pairs |
| Score candidates | Weighted combination of deterministic overlap signals, NLP fuzzy name/description matching, and optional ML scoring |
| Apply mitigation | Carry forward stable pairs, retire pairs for removed entities, detect pair identity continuity |
| Cluster pairs | Greedy correlation clustering groups high-confidence duplicate pairs into clusters |
| Materialize review queue | Pairs that score above the maybe threshold but below duplicate are surfaced for human review |
| Prepare artifacts | Package all stage outputs into a typed bundle ready for downstream persistence |
The component is storage-agnostic: it accepts and emits Polars DataFrames, so you wire it into whatever persistence layer your Dagster project uses.
Prerequisites
- Python 3.10–3.13
- A Dagster project managed with the
dgCLI - Four upstream Dagster assets that supply denormalized HSDS entity data (see Input assets)
Quickstart
1. Discover the component
dg list components --package hsds_entity_resolution
2. Add to your component YAML
type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes:
team_id: my_team
scope_id: regional
entity_type: organization
That's enough to wire the component into your Dagster asset graph. The four required upstream
assets default to the keys organization_entities, service_entities,
previous_entity_index, and previous_pair_state_index — override them with the
*_asset_key attributes if your project uses different names.
Input assets
The component depends on four upstream assets. Each must yield a Polars DataFrame
(or be coercible to one).
organization_entities and service_entities
Denormalized HSDS entity records. Required columns:
| Column | Type | Description |
|---|---|---|
entity_id |
str |
Stable unique identifier for this record |
source_schema |
str |
Tenant or source identifier (e.g. il211_regional) |
name |
str |
Entity name used for NLP scoring |
description |
str |
Entity description used for NLP scoring |
emails |
list[str] |
Normalized email addresses |
phones |
list[str] |
Normalized phone numbers |
websites |
list[str] |
Normalized domain names |
locations |
list[dict] |
Structured address/location objects |
taxonomies |
list[dict] |
AIRS taxonomy code objects |
identifiers |
list[dict] |
External identifier objects |
embedding_vector |
list[float] |
Pre-computed text embedding (e.g. 384-dim) |
Optional passthrough columns (display_name, display_description, alternate_name,
short_description, application_process, fees_description, eligibility_description,
resource_writer_name, assured_date, assurer_email, original_id,
organization_original_id, organization_name, organization_id,
services_rollup) are forwarded unchanged to the normalized entity cache and
persistence artifacts.
previous_entity_index
The normalized_organization or normalized_service output from the previous run of this
component. Pass an empty pl.DataFrame() on first run.
previous_pair_state_index
The pair_state_index output from the previous run of this component. Pass an empty
pl.DataFrame() on first run. The component uses this to carry forward stable pair scores
and detect pair identity continuity across incremental runs.
Component attributes
| Attribute | Type | Default | Description |
|---|---|---|---|
team_id |
str |
"hsds" |
Team identifier written into run metadata |
scope_id |
str |
"default" |
Deployment or region identifier |
entity_type |
"organization" | "service" |
"organization" |
Which entity type this instance processes |
policy_version |
str |
"hsds-er-v1" |
Scoring policy version tag |
model_version |
str |
"embedding-only-v1" |
ML model version tag |
explicit_backfill |
bool |
False |
Force a full re-run even when no entity changes are detected |
organization_entities_asset_key |
str |
"organization_entities" |
Asset key for upstream org entities |
service_entities_asset_key |
str |
"service_entities" |
Asset key for upstream service entities |
previous_entity_index_asset_key |
str |
"previous_entity_index" |
Asset key for previous-run entity cache |
previous_pair_state_index_asset_key |
str |
"previous_pair_state_index" |
Asset key for previous-run pair state |
output_asset_prefix |
list[str] |
[] |
Prefix prepended to all output asset keys |
constants_overrides |
dict |
{} |
Deep-merge override for scoring constants (see Tuning) |
Output assets
The component emits fifteen assets. Using output_asset_prefix: ["my_prefix"] namespaces
all keys under my_prefix/.
| Asset key | Description |
|---|---|
normalized_organization |
Cleaned, normalized organization records with content hashes |
normalized_service |
Cleaned, normalized service records with content hashes |
entity_delta_summary |
Single-row summary of added / changed / removed entity counts |
removed_entity_ids |
Entity IDs that were present last run but are absent this run |
candidate_pairs |
Raw blocked candidate pairs before scoring |
scored_pairs |
Candidate pairs with composite scores and tier labels (duplicate, maybe, non_duplicate) |
pair_reasons |
Per-feature evidence rows for every scored pair |
mitigation_events |
Pairs reassigned due to entity changes between runs |
removed_pair_ids |
Pair IDs retired this run (entity removed or score dropped) |
pair_id_remap |
Maps old pair IDs to new ones when pair identity shifts |
clusters |
Entity groups identified as duplicates via correlation clustering |
cluster_pairs |
Edges within each cluster |
pair_state_index |
Full pair state — feed this back as previous_pair_state_index next run |
review_queue_items |
Borderline pairs above the maybe threshold, ready for human review |
run_summary |
Single-row run metrics (candidate count, duplicate count, cluster count, etc.) |
persistence_artifact_bundle |
Dict containing all stage outputs packaged for bulk persistence |
Tuning scoring constants
The default thresholds are calibrated for HSDS data. Override any constant via
constants_overrides without subclassing:
type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes:
team_id: my_team
scope_id: regional
entity_type: organization
constants_overrides:
scoring:
duplicate_threshold: 0.85
maybe_threshold: 0.70
nlp:
fuzzy_threshold: 0.90
Key thresholds:
| Constant | Default (org) | Default (service) | Description |
|---|---|---|---|
scoring.duplicate_threshold |
0.82 |
0.70 |
Minimum score to auto-cluster as duplicate |
scoring.maybe_threshold |
0.68 |
0.62 |
Minimum score to send to review queue |
scoring.deterministic_section_weight |
0.45 |
0.40 |
Weight of overlap-signal section |
scoring.nlp_section_weight |
0.35 |
0.40 |
Weight of NLP fuzzy-match section |
scoring.ml_section_weight |
0.20 |
0.20 |
Weight of ML section (disabled by default) |
scoring.nlp.fuzzy_threshold |
0.88 |
0.86 |
Minimum name similarity to count as NLP match |
blocking.similarity_threshold |
0.75 |
0.75 |
Minimum embedding cosine similarity for blocking |
blocking.max_candidates_per_entity |
50 |
125 |
Maximum candidate pairs per entity |
Multiple instances
Deploy one component instance per entity type and scope:
# organizations
type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes:
team_id: my_team
scope_id: regional
entity_type: organization
output_asset_prefix: ["org_er"]
---
# services
type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes:
team_id: my_team
scope_id: regional
entity_type: service
output_asset_prefix: ["svc_er"]
Links
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hsds_record_matcher-1.1.0.tar.gz.
File metadata
- Download URL: hsds_record_matcher-1.1.0.tar.gz
- Upload date:
- Size: 628.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8eae6fbf05686225b2af054350b0d046d803d8e0f42b6d2fb615d96fc579d84
|
|
| MD5 |
a871dd8a5315941b33528a0150b4bffa
|
|
| BLAKE2b-256 |
a6303325b7be32974f2eba6326b6cb7d022bcc14de2907a9a1b60154ae7b3826
|
Provenance
The following attestation bundles were made for hsds_record_matcher-1.1.0.tar.gz:
Publisher:
publish.yml on 211-Connect/hsds-entity-resolution
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hsds_record_matcher-1.1.0.tar.gz -
Subject digest:
d8eae6fbf05686225b2af054350b0d046d803d8e0f42b6d2fb615d96fc579d84 - Sigstore transparency entry: 1455018329
- Sigstore integration time:
-
Permalink:
211-Connect/hsds-entity-resolution@21b58b9894ee8a60a7c7b41a95af1252a3364858 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/211-Connect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@21b58b9894ee8a60a7c7b41a95af1252a3364858 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hsds_record_matcher-1.1.0-py3-none-any.whl.
File metadata
- Download URL: hsds_record_matcher-1.1.0-py3-none-any.whl
- Upload date:
- Size: 569.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b583e800270b7ec38a26da646e981e048ce0f77b4681d9f4aaf15d9f783b3d3b
|
|
| MD5 |
419d3d8b892e87f4f30ace696cf480bf
|
|
| BLAKE2b-256 |
c208a73ef934ff421d500239fc1e9b828d817ebf535369fe39ff02cb7cef3337
|
Provenance
The following attestation bundles were made for hsds_record_matcher-1.1.0-py3-none-any.whl:
Publisher:
publish.yml on 211-Connect/hsds-entity-resolution
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hsds_record_matcher-1.1.0-py3-none-any.whl -
Subject digest:
b583e800270b7ec38a26da646e981e048ce0f77b4681d9f4aaf15d9f783b3d3b - Sigstore transparency entry: 1455018437
- Sigstore integration time:
-
Permalink:
211-Connect/hsds-entity-resolution@21b58b9894ee8a60a7c7b41a95af1252a3364858 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/211-Connect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@21b58b9894ee8a60a7c7b41a95af1252a3364858 -
Trigger Event:
push
-
Statement type: