Reusable Dagster component package for HSDS entity resolution workflows.
Project description
hsds_entity_resolution
hsds_entity_resolution helps community organizations deduplicate HSDS data and orchestrate
continual checks that support long-running community data sharing partnerships.
Project goals
- Improve entity matching quality across partner-provided HSDS datasets
- Reduce duplicate records that block trusted cross-organization coordination
- Run repeatable validation and quality checks as data pipelines evolve
- Support sustainable, long-term community data sharing operations
Tooling
- Dagster (
dagster,dg): pipeline orchestration, definitions, and local development UI - dbt (Snowflake adapter): SQL management for source denormalization and incremental persistence
- dagster-dbt: Dagster integration that invokes dbt staging and mart phases inside jobs
- Pydantic v2: typed data models and validation for HSDS entities and pipeline I/O
- Ruff: Python formatting and linting for fast local feedback
- Pyright: static type checking for
src/andtests/ - Codacy CLI (
.codacy/cli.sh): static analysis and security scanning (Pylint, Semgrep, Lizard, Trivy) - uv: dependency and virtual environment management
Component Package Layout
Reusable Dagster components live in:
src/hsds_entity_resolution/dagster/components/
Core library code should live outside the Dagster adapter layer:
src/hsds_entity_resolution/core/src/hsds_entity_resolution/types/src/hsds_entity_resolution/config/
The canonical public component entry point is:
hsds_entity_resolution.dagster.components.EntityResolutionComponent
This module is exported through the Dagster registry entry-point group:
dagster_dg_cli.registry_modules
Getting started
Install dependencies
Ensure uv is installed following the
official documentation, then run:
uv sync
Run the project
This repo has two entry points depending on what you are working on:
| Goal | Command |
|---|---|
| Run the IL211 pipeline (jobs, schedules, Snowflake) | uv run dagster dev -m consumer.definitions |
Develop the EntityResolutionComponent library |
dg dev |
For pipeline development and debugging, always use:
uv run dagster dev -m consumer.definitions
Then open http://localhost:3000 and go to Deployment → consumer.definitions → Jobs to find:
entity_resolution__il211_regional__organizationentity_resolution__il211_regional__service
Use the Launchpad tab on either job to configure a run (e.g. restrict to one
source_schema for faster local testing) and launch it manually.
dg dev loads the reusable EntityResolutionComponent package — it intentionally
has no jobs or assets of its own and is only useful when working on the component
library itself.
dbt project (consumer/dbt/)
The pipeline uses a dbt project to manage all complex SQL in one place. It runs in two phases inside every Dagster job:
| Phase | dbt select | What it does |
|---|---|---|
| Staging (before Python ER) | --select staging |
Materializes stg_service_denormalized and stg_organization_denormalized tables in DEDUPLICATION.ER_STAGING from raw HSDS tables in NORSE_STAGING |
| Marts (after Python ER stages artifacts) | --select marts |
Incremental merge models upsert artifact staging rows into the final output tables in DEDUPLICATION.ER_RUNTIME |
Do you need to run any dbt commands before startup?
No. dagster dev -m consumer.definitions starts cleanly — the DbtCliResource
is just a resource handle at startup and triggers no dbt execution. dbt parses
and compiles the project automatically when each job phase runs.
No external dbt packages are used, so dbt deps is never required. dbt build
is not used; Dagster controls execution order via the phased job structure.
Useful sanity check during development
After editing the dbt project, validate syntax and macro references without hitting Snowflake:
cd consumer/dbt
uv run dbt parse --profiles-dir .
This confirms all Jinja loops compile, macro calls are valid, and sources.yml
references are consistent.
Required environment variables for dbt
| Variable | Default | Purpose |
|---|---|---|
SNOWFLAKE_ACCOUNT |
— | Snowflake account identifier |
SNOWFLAKE_USERNAME |
— | Snowflake username |
SNOWFLAKE_PASSWORD |
— | Snowflake password (or use SNOWFLAKE_PRIVATE_KEY_PATH) |
SNOWFLAKE_ROLE |
SYSADMIN |
Snowflake role |
SNOWFLAKE_WAREHOUSE |
— | Snowflake virtual warehouse |
ER_TARGET_DATABASE |
DEDUPLICATION |
Database for runtime and reconciliation tables |
ER_RUNTIME_SCHEMA |
ER_RUNTIME |
Schema for mart output tables |
ER_INCREMENTAL_STATE_SCHEMA |
ER_INCREMENTAL_STATE |
Schema for incremental state tables |
ER_STAGING_DATABASE |
DEDUPLICATION |
Database for persistent staging tables |
ER_STAGING_SCHEMA |
ER_STAGING |
Schema for persistent staging tables |
ER_HSDS_DATABASE |
NORSE_STAGING |
Source HSDS database |
dbt project structure
consumer/dbt/
dbt_project.yml — project config, model materialization defaults
profiles.yml — Snowflake connection (env-var based; DbtCliResource overrides in prod)
models/
sources.yml — er_staging source definitions with not_null/unique tests
schema.yml — schema tests for staging and mart models
staging/
stg_service_denormalized.sql — multi-tenant UNION over target_schemas
stg_organization_denormalized.sql — multi-tenant UNION over target_schemas
marts/
denormalized_service_cache.sql
denormalized_organization_cache.sql
deduplication_run.sql
duplicate_pairs.sql
duplicate_pair_scores.sql
duplicate_reasons.sql
mitigated_pairs.sql
duplicate_clusters.sql
duplicate_cluster_pairs.sql
macros/
taxonomy_rollup.sql — ARRAY_AGG of taxonomy objects for service or org
location_rollup_service.sql — SAL → LOCATION → ADDRESS for services
location_rollup_org.sql — LOCATION.ORGANIZATION_ID → ADDRESS for orgs
phone_rollup_service.sql — 3-path phone UNION for services
phone_rollup_org.sql — 4-path phone UNION for organizations
service_rollup.sql — org's services with nested taxonomy codes
service_contact_rollup.sql — service-level email/website rollup to org
Contributing
See CONTRIBUTING.md for pull request requirements, quality checks, and review expectations.
Additional Docs
Using This In Another Dagster Repo
- Publish or install this package (for example:
pip install hsds-record-matcher). - Confirm discovery in the target environment:
dg list components --package hsds_entity_resolution
- Use the component key in YAML:
type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes: {}
Publishing
This package is set up to publish to PyPI from GitHub Actions via Trusted Publishing.
PyPI Trusted Publisher settings
For the pending or normal PyPI publisher, use:
- PyPI project name:
hsds-record-matcher - Owner:
211-Connect - Repository name:
hsds-entity-resolution - Workflow name:
publish.yml - Environment name:
pypi
The repository name field should be only the repository name, not owner/repo.
The distribution name on PyPI is independent from the import path in Python:
- Install name:
hsds-record-matcher - Import path:
hsds_entity_resolution
Release flow
- Update
versioninpyproject.toml. - Merge or push that change to
main. - GitHub Actions will build the wheel and sdist, validate them with
twine check, and publish to PyPI through thepypienvironment if the version changed.
You can also run the publish workflow manually with workflow_dispatch.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hsds_record_matcher-1.0.0.tar.gz.
File metadata
- Download URL: hsds_record_matcher-1.0.0.tar.gz
- Upload date:
- Size: 600.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
679c47e043b05137018bfd8f2adfbee244988ece5fc61d64032e79fca19cbaab
|
|
| MD5 |
18fde70459cd106379e5a1dff5c0e395
|
|
| BLAKE2b-256 |
9f8f1ee8dbe81239555abd5c09aa7982ad7879386c27842fff30c2cffed30acd
|
Provenance
The following attestation bundles were made for hsds_record_matcher-1.0.0.tar.gz:
Publisher:
publish.yml on 211-Connect/hsds-entity-resolution
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hsds_record_matcher-1.0.0.tar.gz -
Subject digest:
679c47e043b05137018bfd8f2adfbee244988ece5fc61d64032e79fca19cbaab - Sigstore transparency entry: 1364453826
- Sigstore integration time:
-
Permalink:
211-Connect/hsds-entity-resolution@f5bc910977bc9782b67748cf0f20df73e9af0ef4 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/211-Connect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f5bc910977bc9782b67748cf0f20df73e9af0ef4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hsds_record_matcher-1.0.0-py3-none-any.whl.
File metadata
- Download URL: hsds_record_matcher-1.0.0-py3-none-any.whl
- Upload date:
- Size: 552.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a04bd8678b24680f98daf4bbee5aaa22e739a9e92c665918c8363b24427cc9f8
|
|
| MD5 |
c10f8f160905aebb6e78dab1c165b936
|
|
| BLAKE2b-256 |
1473eb09c495e5aa5beab1c7983611c5e9c3d2e9443db9575ac5a661d3d2effb
|
Provenance
The following attestation bundles were made for hsds_record_matcher-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on 211-Connect/hsds-entity-resolution
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hsds_record_matcher-1.0.0-py3-none-any.whl -
Subject digest:
a04bd8678b24680f98daf4bbee5aaa22e739a9e92c665918c8363b24427cc9f8 - Sigstore transparency entry: 1364453910
- Sigstore integration time:
-
Permalink:
211-Connect/hsds-entity-resolution@f5bc910977bc9782b67748cf0f20df73e9af0ef4 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/211-Connect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f5bc910977bc9782b67748cf0f20df73e9af0ef4 -
Trigger Event:
push
-
Statement type: