Skip to main content

Queryable open data catalog engine for DCAT-AP, StatDCAT-AP, CKAN, and SDMX

Project description

pinax

Note: This library has been written extensively with AI assistance (Claude Code). Users who are not comfortable with AI-generated code should take that into account before adopting it.

pinax is a Python library for building, managing, and querying statistical metadata catalogs. Named after "bibliographic work composed by Callimachus (310/305–240 BCE) that is popularly considered to be the first library catalog in the West"Pinax

Inspired by DCAT and the broader landscape of metadata standards (SDMX, DDI), pinax provides a general-purpose engine for catalog storage, discovery, and retrieval — designed to be embedded in ETL pipelines, data platforms, and analytical tooling rather than used as a standalone application. It pairs naturally with domain-specific libraries like sdmxlib for standards-aware workflows.

import pinax as pk

What it does

pinax materializes remote catalog and structural metadata into a local DuckDB database and exposes a fluent Python API for discovery queries — filtering by publisher, theme, dimension, code, free text, or provenance lineage.

The catalog is a materialized graph. Rather than federated queries across separate REST endpoints, ingest pulls the full structural model into one database. SQL JOINs are the graph traversal. No network round-trips at query time.

Installation

Requires Python 3.13+. Managed with uv.

uv add pinaxlib
# or: pip install pinaxlib

The distribution is published on PyPI as pinaxlib; the import name remains pinax.

Quick start

import pinax as pk
import pinax.query as q
import sdmxlib as sl

# Creates my_catalog/bundle.duckdb and my_catalog/parquet/
with pk.Catalog("my_catalog").open() as store:
    with sl.RestRegistry(sl.Provider.ESTAT) as reg:
        reg.get(sl.Dataflow, agency="ESTAT", id="NAMA_10_GDP").resolve()
        pk.ingest_sdmx(store, reg.registry)

    # Catalog query — no network needed
    results = (
        store.query(pk.AggregateDataset)
        .filter(q.has_code("geo", "DE"))
        .search("GDP")
        .with_facets("themes", "frequency")
        .execute()
    )
    print(results.total, "datasets found")
    print(results.facets["themes"])

Five dataset kinds

pinax uses a discriminated union of dataset types aligned with DCAT-AP and StatDCAT-AP:

# Generic datasets (CKAN, open portals)
pk.OpenDataset(identifier="co2-2024", title=i18n("CO2 Emissions 2024"), ...)

# Statistical tables with SDMX structure (dimensions, codelists)
pk.AggregateDataset(identifier="ESTAT:UNE_RT_M(1.0)", ..., sdmx_dataflow_urn="urn:...")

# Survey microdata with variable-level metadata
pk.MicrodataDataset(identifier="lfs-2023", ..., variables=[...])

# Spatial datasets with bounding box and CRS
pk.GeospatialDataset(identifier="boundaries-2024", ..., crs="EPSG:4326")

# Articles, reports, and analytical publications
pk.PublicationDataset(identifier="pub-71-607-x", ..., doi="10.25318/...", authors=[...])

All five share the same store and query API. Use pk.BaseDataset as the bound for generic code; pk.Dataset is a type alias for the full union.

Source connectors

SDMX (Eurostat, OECD, BIS, ABS, ...)

import sdmxlib as sl
import pinax as pk

with pk.Catalog("my_catalog").open() as store:
    with sl.RestRegistry(sl.Provider.ESTAT) as reg:
        for df_id in ["NAMA_10_GDP", "UNE_RT_M", "PRIC_HPI_IDX"]:
            reg.get(sl.Dataflow, agency="ESTAT", id=df_id).resolve()
        pk.ingest_sdmx(store, reg.registry)

        # Stream observation data to Parquet, attach as Distribution
        pk.ingest_data(store, reg, "ESTAT:NAMA_10_GDP(latest)", measure_dim="na_item")

Statistics Canada

from pinax.sources.statcan import WDSClient, NDMClient

with pk.Catalog("statcan").open() as store:
    with WDSClient() as wds:
        pk.ingest_statcan_table(store, wds, "14100287")   # Labour Force Survey

    with NDMClient() as ndm:
        pk.ingest_statcan_publications(store, ndm, product_type="82", limit=200)

CKAN (open.canada.ca, data.gov, ...)

from pinax.sources.ckan import CkanClient

with pk.Catalog("open_canada").open() as store:
    with CkanClient("https://open.canada.ca/data") as client:
        pk.ingest_ckan(store, client, organization="statcan", rows=500)

Scope-based graph traversal

pinax exposes a lazy, scope-based API for navigating the catalog graph. Navigation builds scope objects without executing SQL; only terminal methods (.collect(), .count()) hit the database.

import pinax as pk

store = pk.Catalog("my_catalog").open()

# Navigate themes — no SQL until .collect()
concepts = store.themes["statcan"].collect()          # ItemList[Concept]
concept = store.themes["statcan"]["13"].collect()     # Concept

# Cross-entity navigation — .datasets returns a lazy QueryBuilder
datasets = store.themes["statcan"]["13"].datasets.collect()

# Enrich with sub-traversal expressions (like Polars' pl.col())
store.themes["statcan"].enrich(
    n=pk.each("datasets").count(),
    has_data=pk.each("datasets").exists(),
).collect()

# Codelist navigation — pk.urn builds URN strings for you
codes = store.codelist(pk.urn.codelist("SDMX", "CL_GEO")).collect()    # ItemList[Code]
code = store.codelist(pk.urn.codelist("SDMX", "CL_GEO"))["ON"].collect()  # Code

# CodelistsScope — parallel to ThemesScope, supports enrich
# Enriched output includes labels resolved via sdmx.localized_text
store.codelists.lang("en").enrich(n=pk.each("datasets").count()).collect()
# → [{"urn": "...", "label": "Geography", "n": 5}, ...]

# Filter codelists by label text (SQL-level, case-insensitive)
store.codelists.filter(text_contains="geo").enrich(n=pk.each("datasets").count()).collect()

store.codelist(urn).label("en")   # quick name lookup

# Enriched per-code output includes code name labels
store.codelist(urn).lang("en").enrich(n=pk.each("datasets").count()).collect()
# → [{"code_id": "ON", "label": "Ontario", "n": 5}, ...]

# Batch label resolution — single SQL query for many codes
store.codelist(geo_urn).batch_labels(["CA", "US", "DE"])
# → {"CA": "Canada", "US": "United States", "DE": "Germany"}

# Across multiple codelists at once
store.codelists.batch_labels([(geo_urn, "CA"), (freq_urn, "A")], lang="en")
# → {(geo_urn, "CA"): "Canada", (freq_urn, "A"): "Annual"}

# Dimension traversal
dims = store.dimensions(ds).collect()              # ItemList[DimensionInfo]
codelist = store.dimensions(ds)["GEO"].codelist    # CodelistScope (lazy)

Scope classes: ConceptSchemesScope, ConceptSchemeScope, ThemesScope, SchemeScope, ConceptScope, CodelistsScope, CodelistScope, CodeScope, DimensionsScope, DimensionScope.

Expression types: pk.each("edge") creates a context-free sub-traversal expression. Reusable across .enrich(), .filter(), and .sort_by().

Query API

Structured filters

import pinax.query as q

# Field filters (keyword arguments)
store.query(pk.AggregateDataset).filter(publisher="ESTAT", status="current").all()

# Composable filter objects
store.query(pk.AggregateDataset).filter(
    q.has_code("geo", "DE"),          # datasets with GEO=DE in their codelist
    q.has_dimensions(["geo", "freq"]), # datasets with both GEO and FREQ dimensions
).all()

# Distribution and service filters
store.query(pk.OpenDataset).filter(
    q.distribution(format="CSV"),
    q.has_service(endpoint_url="https://..."),
).all()

Cross-entity queries

# Agents that publish datasets with code GEO=CA
store.query(pk.Agent).filter(
    q.publishes(q.has_code("geo", "CA"), kind="aggregate")
).all()

# Data services serving aggregate datasets
store.query(pk.DataService).filter(
    q.serves(kind="aggregate")
).all()

MAP column filters

# Spatial coverage filter
store.query(pk.BaseDataset).filter(q.has_spatial("Canada")).all()

# Keyword filter
store.query(pk.BaseDataset).filter(q.has_keyword("employment")).all()

# Title/description search (case-insensitive ILIKE)
store.query(pk.BaseDataset).filter(q.title_contains("GDP")).all()
store.query(pk.BaseDataset).filter(q.description_contains("quarterly")).all()

# Sort by multilingual title
store.query(pk.BaseDataset).sort_by("title", lang="en").all()
store.query(pk.BaseDataset).sort_by("title", lang="en", desc=True).all()

Selective relationship loading

By default, querying a list of datasets loads all relationships (~17 queries). Use .include() to declare exactly which relationships to batch-load — the rest are set to an UNLOADED sentinel:

# Only load publisher and themes — 3 queries instead of ~17
results = (
    store.query(pk.BaseDataset)
    .filter(status="published")
    .include("publisher", "themes", "keywords")
    .sort_by("issued", desc=True)
    .limit(20)
    .all()
)

# Explicit full hydration — useful to make the cost visible at the call site
results = store.query(pk.BaseDataset).filter(...).full().all()

# get() always loads everything — no include() needed
ds = store.get(pk.AggregateDataset, "ESTAT:NAMA_10_GDP(1.0)")

Unloaded fields raise pk.NotLoadedError on access. Use pk.is_unloaded(value) to check before accessing:

ds = store.query(pk.BaseDataset).include("publisher").first()
ds.publisher.name    # OK
ds.themes[0]         # raises NotLoadedError: 'themes' was not loaded

if not pk.is_unloaded(ds.themes):
    print(ds.themes)

Valid relationship names: publisher, contact_point, frequency, licence, themes, subject, dataset_type, keywords, spatial_coverage, distributions, conforms_to, quality_annotations, provenance, dimension_names, variables, feature_types, authors.

Lightweight projections

When you need just a few columns (e.g. autocomplete), projection modifiers bypass full object reconstruction — a single SQL query:

# Row projection — .select() + .rows() returns Row objects (dict subclass)
rows = (
    store.query(pk.BaseDataset)
    .filter(q.title_contains("GDP"))
    .sort_by("title", lang="en")
    .limit(3)
    .select("identifier", "title", lang="en")
    .rows()
)
# → [Row({"identifier": "GDP", "title": "GDP Growth"}), ...]
rows[0].identifier   # attribute-style access
rows[0]["title"]     # dict-style access — both work

# Flat value projection — .scalars() + .values() returns bare values
ids = store.query(pk.BaseDataset).filter(status="current").scalars("identifier").values()
# → ["EXR", "M1", "UNEMP", ...]

# Existence check — no object reconstruction
if store.query(pk.AggregateDataset).filter(q.has_code("geo", "DE")).exists():
    ...

Full-text search and facets

# BM25 search across titles, descriptions, keywords, themes, and dimensions
results = store.search("unemployment", limit=20)

# Combined search and filter
results = (
    store.query(pk.AggregateDataset)
    .filter(publisher="ESTAT")
    .search("labour force")
    .with_facets("themes", "frequency")
    .execute()
)
print(results.facets["themes"])   # {"Labour": 18, "Economy": 6, ...}

# Aggregation counts
counts = store.facets("publisher", "themes", "frequency", "status")

Lineage and provenance (PROV-O)

# Record derivation relationships between datasets
store.add_lineage(
    "lfs-microdata",
    "14100287",
    "aggregated_from",
    activity_type="aggregation",
    activity_label="LFS monthly tabulation",
    confidence="asserted",
)

# Transitive upstream/downstream traversal — returns QueryBuilder for chaining
ancestors = store.dataset("14100287").upstream(depth=5).collect()
dependents = store.dataset("CL_GEO").downstream(relationship="uses_classification").collect()

# Chain additional filters after traversal
current = store.dataset("14100287").upstream().filter(status="current").collect()

# Inspect lineage records
rows = store.dataset("14100287").lineage_records(role="target", relationship="aggregated_from")

Architecture

Three layers share one DuckDB database:

Discovery layer  — Catalog            (dataset, agent, concept_scheme, concept, distribution, lineage)
Structural layer — sdmxlib tables     (dataflows, dsd_components, codes, codelists)
Observation layer — Polars / Parquet  (actual time-series data)

pinax owns the discovery layer. sdmxlib owns the structural layer. Both write to the same DuckDB connection — queries JOIN freely across both. Parquet files live alongside the database and are referenced via DCAT Distribution records.

Development

just test              # unit tests
just test-integration  # integration tests (no network)
just test-live         # live tests against real SDMX endpoints
just lint              # ruff check
just typecheck         # basedpyright
just docs              # local docs server

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinaxlib-5.9.0.tar.gz (83.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pinaxlib-5.9.0-py3-none-any.whl (94.1 kB view details)

Uploaded Python 3

File details

Details for the file pinaxlib-5.9.0.tar.gz.

File metadata

  • Download URL: pinaxlib-5.9.0.tar.gz
  • Upload date:
  • Size: 83.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.13 {"installer":{"name":"uv","version":"0.11.13","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pinaxlib-5.9.0.tar.gz
Algorithm Hash digest
SHA256 857616da781e60d3bd1edbc23b532a949c518e31bd6c9788c1176d78a3d54999
MD5 bbe0bfa58c550f9e46afb4a6cbfe40f8
BLAKE2b-256 79fe6713bb8b2e5e05310b7654caa1a71e5675fea430263ac8c1c04c34518dd5

See more details on using hashes here.

File details

Details for the file pinaxlib-5.9.0-py3-none-any.whl.

File metadata

  • Download URL: pinaxlib-5.9.0-py3-none-any.whl
  • Upload date:
  • Size: 94.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.13 {"installer":{"name":"uv","version":"0.11.13","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pinaxlib-5.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7f2fc6641d25f1b0dbc6352bdc54a3ddb976dcc29be6bd521577e1b023491eee
MD5 14f3432f9d4b31781a0b000bb029d94f
BLAKE2b-256 1038e647ce43bf380b324af86c28e9464d7164e9b5493a7982f43d98232153b9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page