Skip to main content

Reusable Dagster component package for HSDS entity resolution workflows.

Project description

hsds_entity_resolution

hsds_entity_resolution helps community organizations deduplicate HSDS data and orchestrate continual checks that support long-running community data sharing partnerships.

Project goals

  • Improve entity matching quality across partner-provided HSDS datasets
  • Reduce duplicate records that block trusted cross-organization coordination
  • Run repeatable validation and quality checks as data pipelines evolve
  • Support sustainable, long-term community data sharing operations

Tooling

  • Dagster (dagster, dg): pipeline orchestration, definitions, and local development UI
  • dbt (Snowflake adapter): SQL management for source denormalization and incremental persistence
  • dagster-dbt: Dagster integration that invokes dbt staging and mart phases inside jobs
  • Pydantic v2: typed data models and validation for HSDS entities and pipeline I/O
  • Ruff: Python formatting and linting for fast local feedback
  • Pyright: static type checking for src/ and tests/
  • Codacy CLI (.codacy/cli.sh): static analysis and security scanning (Pylint, Semgrep, Lizard, Trivy)
  • uv: dependency and virtual environment management

Component Package Layout

Reusable Dagster components live in:

  • src/hsds_entity_resolution/dagster/components/

Core library code should live outside the Dagster adapter layer:

  • src/hsds_entity_resolution/core/
  • src/hsds_entity_resolution/types/
  • src/hsds_entity_resolution/config/

The canonical public component entry point is:

  • hsds_entity_resolution.dagster.components.EntityResolutionComponent

This module is exported through the Dagster registry entry-point group:

  • dagster_dg_cli.registry_modules

Getting started

Install dependencies

Ensure uv is installed following the official documentation, then run:

uv sync

Run the project

This repo has two entry points depending on what you are working on:

Goal Command
Run the IL211 pipeline (jobs, schedules, Snowflake) uv run dagster dev -m consumer.definitions
Develop the EntityResolutionComponent library dg dev

For pipeline development and debugging, always use:

uv run dagster dev -m consumer.definitions

Then open http://localhost:3000 and go to Deployment → consumer.definitions → Jobs to find:

  • entity_resolution__il211_regional__organization
  • entity_resolution__il211_regional__service

Use the Launchpad tab on either job to configure a run (e.g. restrict to one source_schema for faster local testing) and launch it manually.

dg dev loads the reusable EntityResolutionComponent package — it intentionally has no jobs or assets of its own and is only useful when working on the component library itself.

dbt project (consumer/dbt/)

The pipeline uses a dbt project to manage all complex SQL in one place. It runs in two phases inside every Dagster job:

Phase dbt select What it does
Staging (before Python ER) --select staging Materializes stg_service_denormalized and stg_organization_denormalized tables in DEDUPLICATION.ER_STAGING from raw HSDS tables in NORSE_STAGING
Marts (after Python ER stages artifacts) --select marts Incremental merge models upsert artifact staging rows into the final output tables in DEDUPLICATION.ER_RUNTIME

Do you need to run any dbt commands before startup?

No. dagster dev -m consumer.definitions starts cleanly — the DbtCliResource is just a resource handle at startup and triggers no dbt execution. dbt parses and compiles the project automatically when each job phase runs.

No external dbt packages are used, so dbt deps is never required. dbt build is not used; Dagster controls execution order via the phased job structure.

Useful sanity check during development

After editing the dbt project, validate syntax and macro references without hitting Snowflake:

cd consumer/dbt
uv run dbt parse --profiles-dir .

This confirms all Jinja loops compile, macro calls are valid, and sources.yml references are consistent.

Required environment variables for dbt

Variable Default Purpose
SNOWFLAKE_ACCOUNT Snowflake account identifier
SNOWFLAKE_USERNAME Snowflake username
SNOWFLAKE_PASSWORD Snowflake password (or use SNOWFLAKE_PRIVATE_KEY_PATH)
SNOWFLAKE_ROLE SYSADMIN Snowflake role
SNOWFLAKE_WAREHOUSE Snowflake virtual warehouse
ER_TARGET_DATABASE DEDUPLICATION Database for runtime and reconciliation tables
ER_RUNTIME_SCHEMA ER_RUNTIME Schema for mart output tables
ER_INCREMENTAL_STATE_SCHEMA ER_INCREMENTAL_STATE Schema for incremental state tables
ER_STAGING_DATABASE DEDUPLICATION Database for persistent staging tables
ER_STAGING_SCHEMA ER_STAGING Schema for persistent staging tables
ER_HSDS_DATABASE NORSE_STAGING Source HSDS database

dbt project structure

consumer/dbt/
  dbt_project.yml          — project config, model materialization defaults
  profiles.yml             — Snowflake connection (env-var based; DbtCliResource overrides in prod)
  models/
    sources.yml            — er_staging source definitions with not_null/unique tests
    schema.yml             — schema tests for staging and mart models
    staging/
      stg_service_denormalized.sql       — multi-tenant UNION over target_schemas
      stg_organization_denormalized.sql  — multi-tenant UNION over target_schemas
    marts/
      denormalized_service_cache.sql
      denormalized_organization_cache.sql
      deduplication_run.sql
      duplicate_pairs.sql
      duplicate_pair_scores.sql
      duplicate_reasons.sql
      mitigated_pairs.sql
      duplicate_clusters.sql
      duplicate_cluster_pairs.sql
  macros/
    taxonomy_rollup.sql         — ARRAY_AGG of taxonomy objects for service or org
    location_rollup_service.sql — SAL → LOCATION → ADDRESS for services
    location_rollup_org.sql     — LOCATION.ORGANIZATION_ID → ADDRESS for orgs
    phone_rollup_service.sql    — 3-path phone UNION for services
    phone_rollup_org.sql        — 4-path phone UNION for organizations
    service_rollup.sql          — org's services with nested taxonomy codes
    service_contact_rollup.sql  — service-level email/website rollup to org

Contributing

See CONTRIBUTING.md for pull request requirements, quality checks, and review expectations.

Additional Docs

Using This In Another Dagster Repo

  1. Publish or install this package (for example: pip install hsds-record-matcher).
  2. Confirm discovery in the target environment:
dg list components --package hsds_entity_resolution
  1. Use the component key in YAML:
type: hsds_entity_resolution.dagster.components.EntityResolutionComponent
attributes: {}

Publishing

This package is set up to publish to PyPI from GitHub Actions via Trusted Publishing.

PyPI Trusted Publisher settings

For the pending or normal PyPI publisher, use:

  • PyPI project name: hsds-record-matcher
  • Owner: 211-Connect
  • Repository name: hsds-entity-resolution
  • Workflow name: publish.yml
  • Environment name: pypi

The repository name field should be only the repository name, not owner/repo.

The distribution name on PyPI is independent from the import path in Python:

  • Install name: hsds-record-matcher
  • Import path: hsds_entity_resolution

Release flow

  1. Update version in pyproject.toml.
  2. Merge or push that change to main.
  3. GitHub Actions will build the wheel and sdist, validate them with twine check, and publish to PyPI through the pypi environment if the version changed.

You can also run the publish workflow manually with workflow_dispatch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hsds_record_matcher-1.0.0.tar.gz (600.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hsds_record_matcher-1.0.0-py3-none-any.whl (552.9 kB view details)

Uploaded Python 3

File details

Details for the file hsds_record_matcher-1.0.0.tar.gz.

File metadata

  • Download URL: hsds_record_matcher-1.0.0.tar.gz
  • Upload date:
  • Size: 600.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hsds_record_matcher-1.0.0.tar.gz
Algorithm Hash digest
SHA256 679c47e043b05137018bfd8f2adfbee244988ece5fc61d64032e79fca19cbaab
MD5 18fde70459cd106379e5a1dff5c0e395
BLAKE2b-256 9f8f1ee8dbe81239555abd5c09aa7982ad7879386c27842fff30c2cffed30acd

See more details on using hashes here.

Provenance

The following attestation bundles were made for hsds_record_matcher-1.0.0.tar.gz:

Publisher: publish.yml on 211-Connect/hsds-entity-resolution

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hsds_record_matcher-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hsds_record_matcher-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a04bd8678b24680f98daf4bbee5aaa22e739a9e92c665918c8363b24427cc9f8
MD5 c10f8f160905aebb6e78dab1c165b936
BLAKE2b-256 1473eb09c495e5aa5beab1c7983611c5e9c3d2e9443db9575ac5a661d3d2effb

See more details on using hashes here.

Provenance

The following attestation bundles were made for hsds_record_matcher-1.0.0-py3-none-any.whl:

Publisher: publish.yml on 211-Connect/hsds-entity-resolution

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page