Skip to main content

Chicago open data used to validate CEP SNFEI.

Project description

civic-data-identity-us-il

This repository hosts raw and processed Illinois datasets used for validating entity identity, canonicalization, and adapter behavior in the Civic Interconnect project.

The primary dataset is the City of Chicago Contracts Dataset, which provides 182,000+ procurement records used to test:

  • SNFEI identity stability across messy vendor names
  • EFS v1 name normalization
  • Adapter correctness for procurement verticals
  • Cross-record entity consolidation (e.g., same vendor appearing many times)
  • Address normalization (Chicago-specific conventions)
  • Exchange construction (buyer → seller → contract mapping)

This repository contains both the full raw dataset and curated subsets designed as identity test fixtures.


Repository

data/raw/

Unmodified raw datasets retrieved directly from official public sources. Contains large files (up to ~50 MB).
These files are not stored in the main Civic Interconnect repo to avoid repository bloat.

data/identity/

Curated, size-limited datasets (~5k–20k rows) used for:

  • testing SNFEI convergence
  • evaluating string normalization
  • training adapters on realistic noise patterns

These subsets are suitable for inclusion as examples in the main CEP spec repo.

docs/provenance/

Contains PROV-YAML metadata files describing dataset lineage, publishers, and retrieval activities. These files follow W3C PROV-DM conventions.

scripts/

Utility scripts for extracting and shaping subsets from raw data. The provided Python tool generates deterministic, identity-rich samples.


Data Source

City of Chicago – Contracts Dataset

Full provenance is provided in
docs/provenance/chicago_contracts.prov.yaml.


Citation

If you use this repository, please cite both:

  1. This repository (see CITATION.cff)
  2. The original City of Chicago dataset (automatically included in references)

Relationship to civic-interconnect

This repository serves as a data companion to the main specification and implementation in:

https://github.com/civic-interconnect/civic-interconnect

Only smaller derived files (e.g., 5k–20k row identity samples) are copied into the main repository under:

examples/identity/us_il_chicago/

The separation keeps CEP maintainable and free of large artifacts while preserving full reproducibility.


License

Raw public datasets retain their original license (Public Domain for Chicago Open Data).
All derived outputs, scripts, and documentation in this repository are licensed under the MIT License unless otherwise noted.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

civic_data_identity_us_il-0.1.1.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

civic_data_identity_us_il-0.1.1-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file civic_data_identity_us_il-0.1.1.tar.gz.

File metadata

File hashes

Hashes for civic_data_identity_us_il-0.1.1.tar.gz
Algorithm Hash digest
SHA256 35ef29ff1b08aeca523e50cccb4b5f84e7830eb831296270132dfda70438a915
MD5 29044a9f679162d6b8f4cf839f2cc2a7
BLAKE2b-256 fa85be320f915ddcabe97d0ecc5679204a7b2993c98dadceb4b4ff7d83adf783

See more details on using hashes here.

Provenance

The following attestation bundles were made for civic_data_identity_us_il-0.1.1.tar.gz:

Publisher: release.yml on civic-interconnect/civic-data-identity-us-il

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file civic_data_identity_us_il-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for civic_data_identity_us_il-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 849bd8cbeacaaf477d6aafd50ae374dca9d0af5b3917811c3831aa7a31c9210a
MD5 69f27cdf4c9a999f1ed5900e3f5e24f1
BLAKE2b-256 de887707f08f4301653524b7795d9b846c82dfa75c5d13d69088e406302dbf26

See more details on using hashes here.

Provenance

The following attestation bundles were made for civic_data_identity_us_il-0.1.1-py3-none-any.whl:

Publisher: release.yml on civic-interconnect/civic-data-identity-us-il

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page