Skip to main content

Queryable open data catalog engine for DCAT-AP, StatDCAT-AP, CKAN, and SDMX

Project description

pinax

Note: This library has been written extensively with AI assistance (Claude Code). Users who are not comfortable with AI-generated code should take that into account before adopting it.

pinax is a Python library for building, managing, and querying statistical metadata catalogs. Named after "bibliographic work composed by Callimachus (310/305–240 BCE) that is popularly considered to be the first library catalog in the West"Pinax

Inspired by DCAT and the broader landscape of metadata standards (SDMX, DDI), pinax provides a general-purpose engine for catalog storage, discovery, and retrieval — designed to be embedded in ETL pipelines, data platforms, and analytical tooling rather than used as a standalone application. It pairs naturally with domain-specific libraries like sdmxlib for standards-aware workflows.

import pinax as pk

What it does

pinax materializes remote catalog and structural metadata into a local DuckDB database and exposes a fluent Python API for discovery queries — filtering by publisher, theme, dimension, code, free text, or provenance lineage.

The catalog is a materialized graph. Rather than federated queries across separate REST endpoints, ingest pulls the full structural model into one database. SQL JOINs are the graph traversal. No network round-trips at query time.

Installation

Requires Python 3.13+. Managed with uv.

uv add pinaxlib
# or: pip install pinaxlib

The distribution is published on PyPI as pinaxlib; the import name remains pinax.

Quick start

import pinax as pk
import pinax.query as q
import sdmxlib as sl

# Creates my_catalog/bundle.duckdb and my_catalog/parquet/
with pk.Catalog("my_catalog").open() as store:
    with sl.RestRegistry(sl.Provider.ESTAT) as reg:
        reg.get(sl.Dataflow, agency="ESTAT", id="NAMA_10_GDP").resolve()
        pk.ingest_sdmx(store, reg.registry)

    # Catalog query — no network needed
    results = (
        store.query(pk.AggregateDataset)
        .filter(q.has_code("geo", "DE"))
        .search("GDP")
        .with_facets("themes", "frequency")
        .execute()
    )
    print(results.total, "datasets found")
    print(results.facets["themes"])

Five dataset kinds

pinax uses a discriminated union of dataset types aligned with DCAT-AP and StatDCAT-AP:

# Generic datasets (CKAN, open portals)
pk.OpenDataset(identifier="co2-2024", title=i18n("CO2 Emissions 2024"), ...)

# Statistical tables with SDMX structure (dimensions, codelists)
pk.AggregateDataset(identifier="ESTAT:UNE_RT_M(1.0)", ..., sdmx_dataflow_urn="urn:...")

# Survey microdata with variable-level metadata
pk.MicrodataDataset(identifier="lfs-2023", ..., variables=[...])

# Spatial datasets with bounding box and CRS
pk.GeospatialDataset(identifier="boundaries-2024", ..., crs="EPSG:4326")

# Articles, reports, and analytical publications
pk.PublicationDataset(identifier="pub-71-607-x", ..., doi="10.25318/...", authors=[...])

All five share the same store and query API. Use pk.BaseDataset as the bound for generic code; pk.Dataset is a type alias for the full union.

Source connectors

SDMX (Eurostat, OECD, BIS, ABS, ...)

import sdmxlib as sl
import pinax as pk

with pk.Catalog("my_catalog").open() as store:
    with sl.RestRegistry(sl.Provider.ESTAT) as reg:
        for df_id in ["NAMA_10_GDP", "UNE_RT_M", "PRIC_HPI_IDX"]:
            reg.get(sl.Dataflow, agency="ESTAT", id=df_id).resolve()
        pk.ingest_sdmx(store, reg.registry)

        # Stream observation data to Parquet, attach as Distribution
        pk.ingest_data(store, reg, "ESTAT:NAMA_10_GDP(latest)", measure_dim="na_item")

Statistics Canada

from pinax.sources.statcan import WDSClient, NDMClient

with pk.Catalog("statcan").open() as store:
    with WDSClient() as wds:
        pk.ingest_statcan_table(store, wds, "14100287")   # Labour Force Survey

    with NDMClient() as ndm:
        pk.ingest_statcan_publications(store, ndm, product_type="82", limit=200)

CKAN (open.canada.ca, data.gov, ...)

from pinax.sources.ckan import CkanClient

with pk.Catalog("open_canada").open() as store:
    with CkanClient("https://open.canada.ca/data") as client:
        pk.ingest_ckan(store, client, organization="statcan", rows=500)

Scope-based graph traversal

pinax exposes a lazy, scope-based API for navigating the catalog graph. Navigation builds scope objects without executing SQL; only terminal methods (.collect(), .count()) hit the database.

import pinax as pk

store = pk.Catalog("my_catalog").open()

# Navigate themes — no SQL until .collect()
concepts = store.themes["statcan"].collect()          # ItemList[Concept]
concept = store.themes["statcan"]["13"].collect()     # Concept

# Cross-entity navigation — .datasets returns a lazy QueryBuilder
datasets = store.themes["statcan"]["13"].datasets.collect()

# Enrich with sub-traversal expressions (like Polars' pl.col())
store.themes["statcan"].enrich(
    n=pk.each("datasets").count(),
    has_data=pk.each("datasets").exists(),
).collect()

# Codelist navigation — pk.urn builds URN strings for you
codes = store.codelist(pk.urn.codelist("SDMX", "CL_GEO")).collect()    # ItemList[Code]
code = store.codelist(pk.urn.codelist("SDMX", "CL_GEO"))["ON"].collect()  # Code

# CodelistsScope — parallel to ThemesScope, supports enrich
# Enriched output includes labels resolved via sdmx.localized_text
store.codelists.lang("en").enrich(n=pk.each("datasets").count()).collect()
# → [{"urn": "...", "label": "Geography", "n": 5}, ...]

# Filter codelists by label text (SQL-level, case-insensitive)
store.codelists.filter(text_contains="geo").enrich(n=pk.each("datasets").count()).collect()

store.codelist(urn).label("en")   # quick name lookup

# Enriched per-code output includes code name labels
store.codelist(urn).lang("en").enrich(n=pk.each("datasets").count()).collect()
# → [{"code_id": "ON", "label": "Ontario", "n": 5}, ...]

# Batch label resolution — single SQL query for many codes
store.codelist(geo_urn).batch_labels(["CA", "US", "DE"])
# → {"CA": "Canada", "US": "United States", "DE": "Germany"}

# Across multiple codelists at once
store.codelists.batch_labels([(geo_urn, "CA"), (freq_urn, "A")], lang="en")
# → {(geo_urn, "CA"): "Canada", (freq_urn, "A"): "Annual"}

# Dimension traversal
dims = store.dimensions(ds).collect()              # ItemList[DimensionInfo]
codelist = store.dimensions(ds)["GEO"].codelist    # CodelistScope (lazy)

Scope classes: ConceptSchemesScope, ConceptSchemeScope, ThemesScope, SchemeScope, ConceptScope, CodelistsScope, CodelistScope, CodeScope, DimensionsScope, DimensionScope.

Expression types: pk.each("edge") creates a context-free sub-traversal expression. Reusable across .enrich(), .filter(), and .sort_by().

Query API

Structured filters

import pinax.query as q

# Field filters (keyword arguments)
store.query(pk.AggregateDataset).filter(publisher="ESTAT", status="current").all()

# Composable filter objects
store.query(pk.AggregateDataset).filter(
    q.has_code("geo", "DE"),          # datasets with GEO=DE in their codelist
    q.has_dimensions(["geo", "freq"]), # datasets with both GEO and FREQ dimensions
).all()

# Distribution and service filters
store.query(pk.OpenDataset).filter(
    q.distribution(format="CSV"),
    q.has_service(endpoint_url="https://..."),
).all()

Cross-entity queries

# Agents that publish datasets with code GEO=CA
store.query(pk.Agent).filter(
    q.publishes(q.has_code("geo", "CA"), kind="aggregate")
).all()

# Data services serving aggregate datasets
store.query(pk.DataService).filter(
    q.serves(kind="aggregate")
).all()

MAP column filters

# Spatial coverage filter
store.query(pk.BaseDataset).filter(q.has_spatial("Canada")).all()

# Keyword filter
store.query(pk.BaseDataset).filter(q.has_keyword("employment")).all()

# Title/description search (case-insensitive ILIKE)
store.query(pk.BaseDataset).filter(q.title_contains("GDP")).all()
store.query(pk.BaseDataset).filter(q.description_contains("quarterly")).all()

# Sort by multilingual title
store.query(pk.BaseDataset).sort_by("title", lang="en").all()
store.query(pk.BaseDataset).sort_by("title", lang="en", desc=True).all()

Selective relationship loading

By default, querying a list of datasets loads all relationships (~17 queries). Use .include() to declare exactly which relationships to batch-load — the rest are set to an UNLOADED sentinel:

# Only load publisher and themes — 3 queries instead of ~17
results = (
    store.query(pk.BaseDataset)
    .filter(status="published")
    .include("publisher", "themes", "keywords")
    .sort_by("issued", desc=True)
    .limit(20)
    .all()
)

# Explicit full hydration — useful to make the cost visible at the call site
results = store.query(pk.BaseDataset).filter(...).full().all()

# get() always loads everything — no include() needed
ds = store.get(pk.AggregateDataset, "ESTAT:NAMA_10_GDP(1.0)")

Unloaded fields raise pk.NotLoadedError on access. Use pk.is_unloaded(value) to check before accessing:

ds = store.query(pk.BaseDataset).include("publisher").first()
ds.publisher.name    # OK
ds.themes[0]         # raises NotLoadedError: 'themes' was not loaded

if not pk.is_unloaded(ds.themes):
    print(ds.themes)

Valid relationship names: publisher, contact_point, frequency, licence, themes, subject, dataset_type, keywords, spatial_coverage, distributions, conforms_to, quality_annotations, provenance, dimension_names, variables, feature_types, authors.

Lightweight projections

When you need just a few columns (e.g. autocomplete), projection modifiers bypass full object reconstruction — a single SQL query:

# Row projection — .select() + .rows() returns Row objects (dict subclass)
rows = (
    store.query(pk.BaseDataset)
    .filter(q.title_contains("GDP"))
    .sort_by("title", lang="en")
    .limit(3)
    .select("identifier", "title", lang="en")
    .rows()
)
# → [Row({"identifier": "GDP", "title": "GDP Growth"}), ...]
rows[0].identifier   # attribute-style access
rows[0]["title"]     # dict-style access — both work

# Flat value projection — .scalars() + .values() returns bare values
ids = store.query(pk.BaseDataset).filter(status="current").scalars("identifier").values()
# → ["EXR", "M1", "UNEMP", ...]

# Existence check — no object reconstruction
if store.query(pk.AggregateDataset).filter(q.has_code("geo", "DE")).exists():
    ...

Full-text search and facets

# BM25 search across titles, descriptions, keywords, themes, and dimensions
results = store.search("unemployment", limit=20)

# Combined search and filter
results = (
    store.query(pk.AggregateDataset)
    .filter(publisher="ESTAT")
    .search("labour force")
    .with_facets("themes", "frequency")
    .execute()
)
print(results.facets["themes"])   # {"Labour": 18, "Economy": 6, ...}

# Aggregation counts
counts = store.facets("publisher", "themes", "frequency", "status")

Lineage and provenance (PROV-O)

# Record derivation relationships between datasets
store.add_lineage(
    "lfs-microdata",
    "14100287",
    "aggregated_from",
    activity_type="aggregation",
    activity_label="LFS monthly tabulation",
    confidence="asserted",
)

# Transitive upstream/downstream traversal — returns QueryBuilder for chaining
ancestors = store.dataset("14100287").upstream(depth=5).collect()
dependents = store.dataset("CL_GEO").downstream(relationship="uses_classification").collect()

# Chain additional filters after traversal
current = store.dataset("14100287").upstream().filter(status="current").collect()

# Inspect lineage records
rows = store.dataset("14100287").lineage_records(role="target", relationship="aggregated_from")

Architecture

Three layers share one DuckDB database:

Discovery layer  — Catalog            (dataset, agent, concept_scheme, concept, distribution, lineage)
Structural layer — sdmxlib tables     (dataflows, dsd_components, codes, codelists)
Observation layer — Polars / Parquet  (actual time-series data)

pinax owns the discovery layer. sdmxlib owns the structural layer. Both write to the same DuckDB connection — queries JOIN freely across both. Parquet files live alongside the database and are referenced via DCAT Distribution records.

Development

just test              # unit tests
just test-integration  # integration tests (no network)
just test-live         # live tests against real SDMX endpoints
just lint              # ruff check
just typecheck         # basedpyright
just docs              # local docs server

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinaxlib-5.11.0.tar.gz (94.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pinaxlib-5.11.0-py3-none-any.whl (105.4 kB view details)

Uploaded Python 3

File details

Details for the file pinaxlib-5.11.0.tar.gz.

File metadata

  • Download URL: pinaxlib-5.11.0.tar.gz
  • Upload date:
  • Size: 94.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.13 {"installer":{"name":"uv","version":"0.11.13","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pinaxlib-5.11.0.tar.gz
Algorithm Hash digest
SHA256 37b969186ce9cc50dafc4585fdec4e5fe95e62726956977f2e353b23f7109315
MD5 1d33567fcf8bc298a8b6eb037320da9d
BLAKE2b-256 adf7f68e01a1ee9924320d1a68d5abc0405149aabcb68548c21271deb373c354

See more details on using hashes here.

File details

Details for the file pinaxlib-5.11.0-py3-none-any.whl.

File metadata

  • Download URL: pinaxlib-5.11.0-py3-none-any.whl
  • Upload date:
  • Size: 105.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.13 {"installer":{"name":"uv","version":"0.11.13","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pinaxlib-5.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ad5094f450b7dd911e0a3df0f826ed955a1dfb54b005e2dafeceea0f7bcc0c5
MD5 3876405b7b0227aa62f653ca1fe48efc
BLAKE2b-256 5a60b853ece1e61bf4e746322c674218b37e373ba8a817fb7da3385715834588

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page