Reusable Dagster component package for HSDS entity resolution workflows.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

CheetoBandito

These details have not been verified by PyPI

Project description

hsds-record-matcher

Reusable Dagster component for incremental entity resolution on HSDS (Human Services Data Specification) records.

Install:

pip install hsds-record-matcher

What this package does

Community 211 directories and social-services platforms often ingest the same organizations and services from multiple data partners. Over time those records diverge — different spellings, phone formats, partial addresses — making it hard to know which rows describe the same real-world entity.

hsds-record-matcher provides a single Dagster component, EntityResolutionComponent, that runs an incremental seven-stage entity-resolution pipeline on your HSDS data:

Stage	What happens
Clean entities	Normalize contact fields, compute content hashes, detect adds/changes/removals since the last run
Generate candidates	Block on overlap signals (email, phone, domain, taxonomy, location) to produce candidate pairs
Score candidates	Weighted combination of deterministic overlap signals, NLP fuzzy name/description matching, and optional ML scoring
Apply mitigation	Carry forward stable pairs, retire pairs for removed entities, detect pair identity continuity
Cluster pairs	Greedy correlation clustering groups high-confidence duplicate pairs into clusters
Materialize review queue	Pairs that score above the maybe threshold but below duplicate are surfaced for human review
Prepare artifacts	Package all stage outputs into a typed bundle ready for downstream persistence

The component is storage-agnostic: it accepts and emits Polars DataFrames, so you wire it into whatever persistence layer your Dagster project uses.

Prerequisites

Python 3.10–3.13
A Dagster project managed with the dg CLI
Four upstream Dagster assets that supply denormalized HSDS entity data (see Input assets)

Quickstart

1. Discover the component

dg list components --package hsds_entity_resolution

2. Add to your component YAML

type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes:
  team_id: my_team
  scope_id: regional
  entity_type: organization

That's enough to wire the component into your Dagster asset graph. The four required upstream assets default to the keys organization_entities, service_entities, previous_entity_index, and previous_pair_state_index — override them with the *_asset_key attributes if your project uses different names.

Input assets

The component depends on four upstream assets. Each must yield a Polars DataFrame (or be coercible to one).

`organization_entities` and `service_entities`

Denormalized HSDS entity records. Required columns:

Column	Type	Description
`entity_id`	`str`	Stable unique identifier for this record
`source_schema`	`str`	Tenant or source identifier (e.g. `il211_regional`)
`name`	`str`	Entity name used for NLP scoring
`description`	`str`	Entity description used for NLP scoring
`emails`	`list[str]`	Normalized email addresses
`phones`	`list[str]`	Normalized phone numbers
`websites`	`list[str]`	Normalized domain names
`locations`	`list[dict]`	Structured address/location objects
`taxonomies`	`list[dict]`	AIRS taxonomy code objects
`identifiers`	`list[dict]`	External identifier objects
`embedding_vector`	`list[float]`	Pre-computed text embedding (e.g. 384-dim)

Optional passthrough columns (display_name, display_description, alternate_name, short_description, application_process, fees_description, eligibility_description, resource_writer_name, assured_date, assurer_email, original_id, organization_original_id, organization_name, organization_id, services_rollup) are forwarded unchanged to the normalized entity cache and persistence artifacts.

`previous_entity_index`

The normalized_organization or normalized_service output from the previous run of this component. Pass an empty pl.DataFrame() on first run.

`previous_pair_state_index`

The pair_state_index output from the previous run of this component. Pass an empty pl.DataFrame() on first run. The component uses this to carry forward stable pair scores and detect pair identity continuity across incremental runs.

Component attributes

Attribute	Type	Default	Description
`team_id`	`str`	`"hsds"`	Team identifier written into run metadata
`scope_id`	`str`	`"default"`	Deployment or region identifier
`entity_type`	`"organization" \| "service"`	`"organization"`	Which entity type this instance processes
`policy_version`	`str`	`"hsds-er-v1"`	Scoring policy version tag
`model_version`	`str`	`"embedding-only-v1"`	ML model version tag
`explicit_backfill`	`bool`	`False`	Force a full re-run even when no entity changes are detected
`organization_entities_asset_key`	`str`	`"organization_entities"`	Asset key for upstream org entities
`service_entities_asset_key`	`str`	`"service_entities"`	Asset key for upstream service entities
`previous_entity_index_asset_key`	`str`	`"previous_entity_index"`	Asset key for previous-run entity cache
`previous_pair_state_index_asset_key`	`str`	`"previous_pair_state_index"`	Asset key for previous-run pair state
`output_asset_prefix`	`list[str]`	`[]`	Prefix prepended to all output asset keys
`constants_overrides`	`dict`	`{}`	Deep-merge override for scoring constants (see Tuning)

Output assets

The component emits fifteen assets. Using output_asset_prefix: ["my_prefix"] namespaces all keys under my_prefix/.

Asset key	Description
`normalized_organization`	Cleaned, normalized organization records with content hashes
`normalized_service`	Cleaned, normalized service records with content hashes
`entity_delta_summary`	Single-row summary of added / changed / removed entity counts
`removed_entity_ids`	Entity IDs that were present last run but are absent this run
`candidate_pairs`	Raw blocked candidate pairs before scoring
`scored_pairs`	Candidate pairs with composite scores and tier labels (`duplicate`, `maybe`, `non_duplicate`)
`pair_reasons`	Per-feature evidence rows for every scored pair
`mitigation_events`	Pairs reassigned due to entity changes between runs
`removed_pair_ids`	Pair IDs retired this run (entity removed or score dropped)
`pair_id_remap`	Maps old pair IDs to new ones when pair identity shifts
`clusters`	Entity groups identified as duplicates via correlation clustering
`cluster_pairs`	Edges within each cluster
`pair_state_index`	Full pair state — feed this back as `previous_pair_state_index` next run
`review_queue_items`	Borderline pairs above the maybe threshold, ready for human review
`run_summary`	Single-row run metrics (candidate count, duplicate count, cluster count, etc.)
`persistence_artifact_bundle`	Dict containing all stage outputs packaged for bulk persistence

Tuning scoring constants

The default thresholds are calibrated for HSDS data. Override any constant via constants_overrides without subclassing:

type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes:
  team_id: my_team
  scope_id: regional
  entity_type: organization
  constants_overrides:
    scoring:
      duplicate_threshold: 0.85
      maybe_threshold: 0.70
      nlp:
        fuzzy_threshold: 0.90

Key thresholds:

Constant	Default (org)	Default (service)	Description
`scoring.duplicate_threshold`	`0.82`	`0.70`	Minimum score to auto-cluster as duplicate
`scoring.maybe_threshold`	`0.68`	`0.62`	Minimum score to send to review queue
`scoring.deterministic_section_weight`	`0.45`	`0.40`	Weight of overlap-signal section
`scoring.nlp_section_weight`	`0.35`	`0.40`	Weight of NLP fuzzy-match section
`scoring.ml_section_weight`	`0.20`	`0.20`	Weight of ML section (disabled by default)
`scoring.nlp.fuzzy_threshold`	`0.88`	`0.86`	Minimum name similarity to count as NLP match
`blocking.similarity_threshold`	`0.75`	`0.75`	Minimum embedding cosine similarity for blocking
`blocking.max_candidates_per_entity`	`50`	`125`	Maximum candidate pairs per entity

Multiple instances

Deploy one component instance per entity type and scope:

# organizations
type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes:
  team_id: my_team
  scope_id: regional
  entity_type: organization
  output_asset_prefix: ["org_er"]

---

# services
type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes:
  team_id: my_team
  scope_id: regional
  entity_type: service
  output_asset_prefix: ["svc_er"]

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

CheetoBandito

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.0

May 6, 2026

1.0.3

Apr 30, 2026

1.0.2

Apr 23, 2026

1.0.1

Apr 23, 2026

1.0.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hsds_record_matcher-1.1.0.tar.gz (628.6 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hsds_record_matcher-1.1.0-py3-none-any.whl (569.0 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file hsds_record_matcher-1.1.0.tar.gz.

File metadata

Download URL: hsds_record_matcher-1.1.0.tar.gz
Upload date: May 6, 2026
Size: 628.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hsds_record_matcher-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d8eae6fbf05686225b2af054350b0d046d803d8e0f42b6d2fb615d96fc579d84`
MD5	`a871dd8a5315941b33528a0150b4bffa`
BLAKE2b-256	`a6303325b7be32974f2eba6326b6cb7d022bcc14de2907a9a1b60154ae7b3826`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hsds_record_matcher-1.1.0.tar.gz:

Publisher: publish.yml on 211-Connect/hsds-entity-resolution

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hsds_record_matcher-1.1.0.tar.gz
- Subject digest: d8eae6fbf05686225b2af054350b0d046d803d8e0f42b6d2fb615d96fc579d84
- Sigstore transparency entry: 1455018329
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: 211-Connect/hsds-entity-resolution@21b58b9894ee8a60a7c7b41a95af1252a3364858
- Branch / Tag: refs/heads/main
- Owner: https://github.com/211-Connect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@21b58b9894ee8a60a7c7b41a95af1252a3364858
- Trigger Event: push

File details

Details for the file hsds_record_matcher-1.1.0-py3-none-any.whl.

File metadata

Download URL: hsds_record_matcher-1.1.0-py3-none-any.whl
Upload date: May 6, 2026
Size: 569.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hsds_record_matcher-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b583e800270b7ec38a26da646e981e048ce0f77b4681d9f4aaf15d9f783b3d3b`
MD5	`419d3d8b892e87f4f30ace696cf480bf`
BLAKE2b-256	`c208a73ef934ff421d500239fc1e9b828d817ebf535369fe39ff02cb7cef3337`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hsds_record_matcher-1.1.0-py3-none-any.whl:

Publisher: publish.yml on 211-Connect/hsds-entity-resolution

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hsds_record_matcher-1.1.0-py3-none-any.whl
- Subject digest: b583e800270b7ec38a26da646e981e048ce0f77b4681d9f4aaf15d9f783b3d3b
- Sigstore transparency entry: 1455018437
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: 211-Connect/hsds-entity-resolution@21b58b9894ee8a60a7c7b41a95af1252a3364858
- Branch / Tag: refs/heads/main
- Owner: https://github.com/211-Connect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@21b58b9894ee8a60a7c7b41a95af1252a3364858
- Trigger Event: push

hsds-record-matcher 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

hsds-record-matcher

What this package does

Prerequisites

Quickstart

1. Discover the component

2. Add to your component YAML

Input assets

organization_entities and service_entities

previous_entity_index

previous_pair_state_index

Component attributes

Output assets

Tuning scoring constants

Multiple instances

Links

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`organization_entities` and `service_entities`

`previous_entity_index`

`previous_pair_state_index`