Reusable Dagster component package for HSDS entity resolution workflows.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

hsds_entity_resolution

hsds_entity_resolution helps community organizations deduplicate HSDS data and orchestrate continual checks that support long-running community data sharing partnerships.

Project goals

Improve entity matching quality across partner-provided HSDS datasets
Reduce duplicate records that block trusted cross-organization coordination
Run repeatable validation and quality checks as data pipelines evolve
Support sustainable, long-term community data sharing operations

Tooling

Dagster (dagster, dg): pipeline orchestration, definitions, and local development UI
dbt (Snowflake adapter): SQL management for source denormalization and incremental persistence
dagster-dbt: Dagster integration that invokes dbt staging and mart phases inside jobs
Pydantic v2: typed data models and validation for HSDS entities and pipeline I/O
Ruff: Python formatting and linting for fast local feedback
Pyright: static type checking for src/ and tests/
Codacy CLI (.codacy/cli.sh): static analysis and security scanning (Pylint, Semgrep, Lizard, Trivy)
uv: dependency and virtual environment management

Component Package Layout

Reusable Dagster components live in:

src/hsds_entity_resolution/dagster/components/

Core library code should live outside the Dagster adapter layer:

src/hsds_entity_resolution/core/
src/hsds_entity_resolution/types/
src/hsds_entity_resolution/config/

The canonical public component entry point is:

hsds_entity_resolution.dagster.components.EntityResolutionComponent

This module is exported through the Dagster registry entry-point group:

dagster_dg_cli.registry_modules

Getting started

Install dependencies

Ensure uv is installed following the official documentation, then run:

uv sync

Run the project

This repo has two entry points depending on what you are working on:

Goal	Command
Run the IL211 pipeline (jobs, schedules, Snowflake)	`uv run dagster dev -m consumer.definitions`
Develop the `EntityResolutionComponent` library	`dg dev`

For pipeline development and debugging, always use:

uv run dagster dev -m consumer.definitions

Then open http://localhost:3000 and go to Deployment → consumer.definitions → Jobs to find:

entity_resolution__il211_regional__organization
entity_resolution__il211_regional__service

Use the Launchpad tab on either job to configure a run (e.g. restrict to one source_schema for faster local testing) and launch it manually.

dg dev loads the reusable EntityResolutionComponent package — it intentionally has no jobs or assets of its own and is only useful when working on the component library itself.

dbt project (`consumer/dbt/`)

The pipeline uses a dbt project to manage all complex SQL in one place. It runs in two phases inside every Dagster job:

Phase	dbt select	What it does
Staging (before Python ER)	`--select staging`	Materializes `stg_service_denormalized` and `stg_organization_denormalized` tables in `DEDUPLICATION.ER_STAGING` from raw HSDS tables in `NORSE_STAGING`
Marts (after Python ER stages artifacts)	`--select marts`	Incremental merge models upsert artifact staging rows into the final output tables in `DEDUPLICATION.ER_RUNTIME`

Do you need to run any dbt commands before startup?

No. dagster dev -m consumer.definitions starts cleanly — the DbtCliResource is just a resource handle at startup and triggers no dbt execution. dbt parses and compiles the project automatically when each job phase runs.

No external dbt packages are used, so dbt deps is never required. dbt build is not used; Dagster controls execution order via the phased job structure.

Useful sanity check during development

After editing the dbt project, validate syntax and macro references without hitting Snowflake:

cd consumer/dbt
uv run dbt parse --profiles-dir .

This confirms all Jinja loops compile, macro calls are valid, and sources.yml references are consistent.

Required environment variables for dbt

Variable	Default	Purpose
`SNOWFLAKE_ACCOUNT`	—	Snowflake account identifier
`SNOWFLAKE_USERNAME`	—	Snowflake username
`SNOWFLAKE_PASSWORD`	—	Snowflake password (or use `SNOWFLAKE_PRIVATE_KEY_PATH`)
`SNOWFLAKE_ROLE`	`SYSADMIN`	Snowflake role
`SNOWFLAKE_WAREHOUSE`	—	Snowflake virtual warehouse
`ER_TARGET_DATABASE`	`DEDUPLICATION`	Database for runtime and reconciliation tables
`ER_RUNTIME_SCHEMA`	`ER_RUNTIME`	Schema for mart output tables
`ER_INCREMENTAL_STATE_SCHEMA`	`ER_INCREMENTAL_STATE`	Schema for incremental state tables
`ER_STAGING_DATABASE`	`DEDUPLICATION`	Database for persistent staging tables
`ER_STAGING_SCHEMA`	`ER_STAGING`	Schema for persistent staging tables
`ER_HSDS_DATABASE`	`NORSE_STAGING`	Source HSDS database

dbt project structure

consumer/dbt/
  dbt_project.yml          — project config, model materialization defaults
  profiles.yml             — Snowflake connection (env-var based; DbtCliResource overrides in prod)
  models/
    sources.yml            — er_staging source definitions with not_null/unique tests
    schema.yml             — schema tests for staging and mart models
    staging/
      stg_service_denormalized.sql       — multi-tenant UNION over target_schemas
      stg_organization_denormalized.sql  — multi-tenant UNION over target_schemas
    marts/
      denormalized_service_cache.sql
      denormalized_organization_cache.sql
      deduplication_run.sql
      duplicate_pairs.sql
      duplicate_pair_scores.sql
      duplicate_reasons.sql
      mitigated_pairs.sql
      duplicate_clusters.sql
      duplicate_cluster_pairs.sql
  macros/
    taxonomy_rollup.sql         — ARRAY_AGG of taxonomy objects for service or org
    location_rollup_service.sql — SAL → LOCATION → ADDRESS for services
    location_rollup_org.sql     — LOCATION.ORGANIZATION_ID → ADDRESS for orgs
    phone_rollup_service.sql    — 3-path phone UNION for services
    phone_rollup_org.sql        — 4-path phone UNION for organizations
    service_rollup.sql          — org's services with nested taxonomy codes
    service_contact_rollup.sql  — service-level email/website rollup to org

Contributing

See CONTRIBUTING.md for pull request requirements, quality checks, and review expectations.

Additional Docs

Using This In Another Dagster Repo

Publish or install this package (for example: pip install hsds-record-matcher).
Confirm discovery in the target environment:

dg list components --package hsds_entity_resolution

Use the component key in YAML:

type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes: {}

Publishing

This package is set up to publish to PyPI from GitHub Actions via Trusted Publishing.

PyPI Trusted Publisher settings

For the pending or normal PyPI publisher, use:

PyPI project name: hsds-record-matcher
Owner: 211-Connect
Repository name: hsds-entity-resolution
Workflow name: publish.yml
Environment name: pypi

The repository name field should be only the repository name, not owner/repo.

The distribution name on PyPI is independent from the import path in Python:

Install name: hsds-record-matcher
Import path: hsds_entity_resolution

Release flow

Update version in pyproject.toml.
Merge or push that change to main.
GitHub Actions will build the wheel and sdist, validate them with twine check, and publish to PyPI through the pypi environment if the version changed.

You can also run the publish workflow manually with workflow_dispatch.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

CheetoBandito

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.0

May 6, 2026

1.0.3

Apr 30, 2026

1.0.2

Apr 23, 2026

1.0.1

Apr 23, 2026

This version

1.0.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hsds_record_matcher-1.0.0.tar.gz (600.5 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hsds_record_matcher-1.0.0-py3-none-any.whl (552.9 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file hsds_record_matcher-1.0.0.tar.gz.

File metadata

Download URL: hsds_record_matcher-1.0.0.tar.gz
Upload date: Apr 23, 2026
Size: 600.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hsds_record_matcher-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`679c47e043b05137018bfd8f2adfbee244988ece5fc61d64032e79fca19cbaab`
MD5	`18fde70459cd106379e5a1dff5c0e395`
BLAKE2b-256	`9f8f1ee8dbe81239555abd5c09aa7982ad7879386c27842fff30c2cffed30acd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hsds_record_matcher-1.0.0.tar.gz:

Publisher: publish.yml on 211-Connect/hsds-entity-resolution

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hsds_record_matcher-1.0.0.tar.gz
- Subject digest: 679c47e043b05137018bfd8f2adfbee244988ece5fc61d64032e79fca19cbaab
- Sigstore transparency entry: 1364453826
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: 211-Connect/hsds-entity-resolution@f5bc910977bc9782b67748cf0f20df73e9af0ef4
- Branch / Tag: refs/heads/main
- Owner: https://github.com/211-Connect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f5bc910977bc9782b67748cf0f20df73e9af0ef4
- Trigger Event: push

File details

Details for the file hsds_record_matcher-1.0.0-py3-none-any.whl.

File metadata

Download URL: hsds_record_matcher-1.0.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 552.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hsds_record_matcher-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a04bd8678b24680f98daf4bbee5aaa22e739a9e92c665918c8363b24427cc9f8`
MD5	`c10f8f160905aebb6e78dab1c165b936`
BLAKE2b-256	`1473eb09c495e5aa5beab1c7983611c5e9c3d2e9443db9575ac5a661d3d2effb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hsds_record_matcher-1.0.0-py3-none-any.whl:

Publisher: publish.yml on 211-Connect/hsds-entity-resolution

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hsds_record_matcher-1.0.0-py3-none-any.whl
- Subject digest: a04bd8678b24680f98daf4bbee5aaa22e739a9e92c665918c8363b24427cc9f8
- Sigstore transparency entry: 1364453910
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: 211-Connect/hsds-entity-resolution@f5bc910977bc9782b67748cf0f20df73e9af0ef4
- Branch / Tag: refs/heads/main
- Owner: https://github.com/211-Connect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f5bc910977bc9782b67748cf0f20df73e9af0ef4
- Trigger Event: push

hsds-record-matcher 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

hsds_entity_resolution

Project goals

Tooling

Component Package Layout

Getting started

Install dependencies

Run the project

dbt project (consumer/dbt/)

Do you need to run any dbt commands before startup?

Useful sanity check during development

Required environment variables for dbt

dbt project structure

Contributing

Additional Docs

Using This In Another Dagster Repo

Publishing

PyPI Trusted Publisher settings

Release flow

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

dbt project (`consumer/dbt/`)