Queryable open data catalog engine for DCAT-AP, StatDCAT-AP, CKAN, and SDMX

These details have not been verified by PyPI

Project links

Project description

pinax

Note: This library has been written extensively with AI assistance (Claude Code). Users who are not comfortable with AI-generated code should take that into account before adopting it.

pinax is a Python library for building, managing, and querying statistical metadata catalogs. Named after "bibliographic work composed by Callimachus (310/305–240 BCE) that is popularly considered to be the first library catalog in the West"Pinax

Inspired by DCAT and the broader landscape of metadata standards (SDMX, DDI), pinax provides a general-purpose engine for catalog storage, discovery, and retrieval — designed to be embedded in ETL pipelines, data platforms, and analytical tooling rather than used as a standalone application. It pairs naturally with domain-specific libraries like sdmxlib for standards-aware workflows.

import pinax as pk

What it does

pinax materializes remote catalog and structural metadata into a local DuckDB database and exposes a fluent Python API for discovery queries — filtering by publisher, theme, dimension, code, free text, or provenance lineage.

The catalog is a materialized graph. Rather than federated queries across separate REST endpoints, ingest pulls the full structural model into one database. SQL JOINs are the graph traversal. No network round-trips at query time.

Installation

Requires Python 3.13+. Managed with uv.

uv add pinaxlib
# or: pip install pinaxlib

The distribution is published on PyPI as pinaxlib; the import name remains pinax.

Quick start

import pinax as pk
import pinax.query as q
import sdmxlib as sl

# Creates my_catalog/bundle.duckdb and my_catalog/parquet/
with pk.Catalog("my_catalog").open() as store:
    with sl.RestRegistry(sl.Provider.ESTAT) as reg:
        reg.get(sl.Dataflow, agency="ESTAT", id="NAMA_10_GDP").resolve()
        pk.ingest_sdmx(store, reg.registry)

    # Catalog query — no network needed
    results = (
        store.query(pk.AggregateDataset)
        .filter(q.has_code("geo", "DE"))
        .search("GDP")
        .with_facets("themes", "frequency")
        .execute()
    )
    print(results.total, "datasets found")
    print(results.facets["themes"])

Five dataset kinds

pinax uses a discriminated union of dataset types aligned with DCAT-AP and StatDCAT-AP:

# Generic datasets (CKAN, open portals)
pk.OpenDataset(identifier="co2-2024", title=i18n("CO2 Emissions 2024"), ...)

# Statistical tables with SDMX structure (dimensions, codelists)
pk.AggregateDataset(identifier="ESTAT:UNE_RT_M(1.0)", ..., sdmx_dataflow_urn="urn:...")

# Survey microdata with variable-level metadata
pk.MicrodataDataset(identifier="lfs-2023", ..., variables=[...])

# Spatial datasets with bounding box and CRS
pk.GeospatialDataset(identifier="boundaries-2024", ..., crs="EPSG:4326")

# Articles, reports, and analytical publications
pk.PublicationDataset(identifier="pub-71-607-x", ..., doi="10.25318/...", authors=[...])

All five share the same store and query API. Use pk.BaseDataset as the bound for generic code; pk.Dataset is a type alias for the full union.

Source connectors

SDMX (Eurostat, OECD, BIS, ABS, ...)

import sdmxlib as sl
import pinax as pk

with pk.Catalog("my_catalog").open() as store:
    with sl.RestRegistry(sl.Provider.ESTAT) as reg:
        for df_id in ["NAMA_10_GDP", "UNE_RT_M", "PRIC_HPI_IDX"]:
            reg.get(sl.Dataflow, agency="ESTAT", id=df_id).resolve()
        pk.ingest_sdmx(store, reg.registry)

        # Stream observation data to Parquet, attach as Distribution
        pk.ingest_data(store, reg, "ESTAT:NAMA_10_GDP(latest)", measure_dim="na_item")

Statistics Canada

from pinax.sources.statcan import WDSClient, NDMClient

with pk.Catalog("statcan").open() as store:
    with WDSClient() as wds:
        pk.ingest_statcan_table(store, wds, "14100287")   # Labour Force Survey

    with NDMClient() as ndm:
        pk.ingest_statcan_publications(store, ndm, product_type="82", limit=200)

CKAN (open.canada.ca, data.gov, ...)

from pinax.sources.ckan import CkanClient

with pk.Catalog("open_canada").open() as store:
    with CkanClient("https://open.canada.ca/data") as client:
        pk.ingest_ckan(store, client, organization="statcan", rows=500)

Scope-based graph traversal

pinax exposes a lazy, scope-based API for navigating the catalog graph. Navigation builds scope objects without executing SQL; only terminal methods (.collect(), .count()) hit the database.

import pinax as pk

store = pk.Catalog("my_catalog").open()

# Navigate themes — no SQL until .collect()
concepts = store.themes["statcan"].collect()          # ItemList[Concept]
concept = store.themes["statcan"]["13"].collect()     # Concept

# Cross-entity navigation — .datasets returns a lazy QueryBuilder
datasets = store.themes["statcan"]["13"].datasets.collect()

# Enrich with sub-traversal expressions (like Polars' pl.col())
store.themes["statcan"].enrich(
    n=pk.each("datasets").count(),
    has_data=pk.each("datasets").exists(),
).collect()

# Codelist navigation — pk.urn builds URN strings for you
codes = store.codelist(pk.urn.codelist("SDMX", "CL_GEO")).collect()    # ItemList[Code]
code = store.codelist(pk.urn.codelist("SDMX", "CL_GEO"))["ON"].collect()  # Code

# CodelistsScope — parallel to ThemesScope, supports enrich
# Enriched output includes labels resolved via sdmx.localized_text
store.codelists.lang("en").enrich(n=pk.each("datasets").count()).collect()
# → [{"urn": "...", "label": "Geography", "n": 5}, ...]

# Filter codelists by label text (SQL-level, case-insensitive)
store.codelists.filter(text_contains="geo").enrich(n=pk.each("datasets").count()).collect()

store.codelist(urn).label("en")   # quick name lookup

# Enriched per-code output includes code name labels
store.codelist(urn).lang("en").enrich(n=pk.each("datasets").count()).collect()
# → [{"code_id": "ON", "label": "Ontario", "n": 5}, ...]

# Batch label resolution — single SQL query for many codes
store.codelist(geo_urn).batch_labels(["CA", "US", "DE"])
# → {"CA": "Canada", "US": "United States", "DE": "Germany"}

# Across multiple codelists at once
store.codelists.batch_labels([(geo_urn, "CA"), (freq_urn, "A")], lang="en")
# → {(geo_urn, "CA"): "Canada", (freq_urn, "A"): "Annual"}

# Dimension traversal
dims = store.dimensions(ds).collect()              # ItemList[DimensionInfo]
codelist = store.dimensions(ds)["GEO"].codelist    # CodelistScope (lazy)

Scope classes: ConceptSchemesScope, ConceptSchemeScope, ThemesScope, SchemeScope, ConceptScope, CodelistsScope, CodelistScope, CodeScope, DimensionsScope, DimensionScope.

Expression types: pk.each("edge") creates a context-free sub-traversal expression. Reusable across .enrich(), .filter(), and .sort_by().

Query API

Structured filters

import pinax.query as q

# Field filters (keyword arguments)
store.query(pk.AggregateDataset).filter(publisher="ESTAT", status="current").all()

# Composable filter objects
store.query(pk.AggregateDataset).filter(
    q.has_code("geo", "DE"),          # datasets with GEO=DE in their codelist
    q.has_dimensions(["geo", "freq"]), # datasets with both GEO and FREQ dimensions
).all()

# Distribution and service filters
store.query(pk.OpenDataset).filter(
    q.distribution(format="CSV"),
    q.has_service(endpoint_url="https://..."),
).all()

Cross-entity queries

# Agents that publish datasets with code GEO=CA
store.query(pk.Agent).filter(
    q.publishes(q.has_code("geo", "CA"), kind="aggregate")
).all()

# Data services serving aggregate datasets
store.query(pk.DataService).filter(
    q.serves(kind="aggregate")
).all()

MAP column filters

# Spatial coverage filter
store.query(pk.BaseDataset).filter(q.has_spatial("Canada")).all()

# Keyword filter
store.query(pk.BaseDataset).filter(q.has_keyword("employment")).all()

# Title/description search (case-insensitive ILIKE)
store.query(pk.BaseDataset).filter(q.title_contains("GDP")).all()
store.query(pk.BaseDataset).filter(q.description_contains("quarterly")).all()

# Sort by multilingual title
store.query(pk.BaseDataset).sort_by("title", lang="en").all()
store.query(pk.BaseDataset).sort_by("title", lang="en", desc=True).all()

Selective relationship loading

By default, querying a list of datasets loads all relationships (~17 queries). Use .include() to declare exactly which relationships to batch-load — the rest are set to an UNLOADED sentinel:

# Only load publisher and themes — 3 queries instead of ~17
results = (
    store.query(pk.BaseDataset)
    .filter(status="published")
    .include("publisher", "themes", "keywords")
    .sort_by("issued", desc=True)
    .limit(20)
    .all()
)

# Explicit full hydration — useful to make the cost visible at the call site
results = store.query(pk.BaseDataset).filter(...).full().all()

# get() always loads everything — no include() needed
ds = store.get(pk.AggregateDataset, "ESTAT:NAMA_10_GDP(1.0)")

Unloaded fields raise pk.NotLoadedError on access. Use pk.is_unloaded(value) to check before accessing:

ds = store.query(pk.BaseDataset).include("publisher").first()
ds.publisher.name    # OK
ds.themes[0]         # raises NotLoadedError: 'themes' was not loaded

if not pk.is_unloaded(ds.themes):
    print(ds.themes)

Valid relationship names: publisher, contact_point, frequency, licence, themes, subject, dataset_type, keywords, spatial_coverage, distributions, conforms_to, quality_annotations, provenance, dimension_names, variables, feature_types, authors.

Lightweight projections

When you need just a few columns (e.g. autocomplete), projection modifiers bypass full object reconstruction — a single SQL query:

# Row projection — .select() + .rows() returns Row objects (dict subclass)
rows = (
    store.query(pk.BaseDataset)
    .filter(q.title_contains("GDP"))
    .sort_by("title", lang="en")
    .limit(3)
    .select("identifier", "title", lang="en")
    .rows()
)
# → [Row({"identifier": "GDP", "title": "GDP Growth"}), ...]
rows[0].identifier   # attribute-style access
rows[0]["title"]     # dict-style access — both work

# Flat value projection — .scalars() + .values() returns bare values
ids = store.query(pk.BaseDataset).filter(status="current").scalars("identifier").values()
# → ["EXR", "M1", "UNEMP", ...]

# Existence check — no object reconstruction
if store.query(pk.AggregateDataset).filter(q.has_code("geo", "DE")).exists():
    ...

Full-text search and facets

# BM25 search across titles, descriptions, keywords, themes, and dimensions
results = store.search("unemployment", limit=20)

# Combined search and filter
results = (
    store.query(pk.AggregateDataset)
    .filter(publisher="ESTAT")
    .search("labour force")
    .with_facets("themes", "frequency")
    .execute()
)
print(results.facets["themes"])   # {"Labour": 18, "Economy": 6, ...}

# Aggregation counts
counts = store.facets("publisher", "themes", "frequency", "status")

Lineage and provenance (PROV-O)

# Record derivation relationships between datasets
store.add_lineage(
    "lfs-microdata",
    "14100287",
    "aggregated_from",
    activity_type="aggregation",
    activity_label="LFS monthly tabulation",
    confidence="asserted",
)

# Transitive upstream/downstream traversal — returns QueryBuilder for chaining
ancestors = store.dataset("14100287").upstream(depth=5).collect()
dependents = store.dataset("CL_GEO").downstream(relationship="uses_classification").collect()

# Chain additional filters after traversal
current = store.dataset("14100287").upstream().filter(status="current").collect()

# Inspect lineage records
rows = store.dataset("14100287").lineage_records(role="target", relationship="aggregated_from")

Architecture

Three layers share one DuckDB database:

Discovery layer  — Catalog            (dataset, agent, concept_scheme, concept, distribution, lineage)
Structural layer — sdmxlib tables     (dataflows, dsd_components, codes, codelists)
Observation layer — Polars / Parquet  (actual time-series data)

pinax owns the discovery layer. sdmxlib owns the structural layer. Both write to the same DuckDB connection — queries JOIN freely across both. Parquet files live alongside the database and are referenced via DCAT Distribution records.

Development

just test              # unit tests
just test-integration  # integration tests (no network)
just test-live         # live tests against real SDMX endpoints
just lint              # ruff check
just typecheck         # basedpyright
just docs              # local docs server

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

5.11.0

May 12, 2026

5.10.1

May 11, 2026

5.10.0

May 11, 2026

5.9.0

May 11, 2026

This version

5.8.1

May 10, 2026

5.8.0

May 10, 2026

5.7.1

May 8, 2026

5.7.0

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinaxlib-5.8.1.tar.gz (82.9 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pinaxlib-5.8.1-py3-none-any.whl (93.3 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file pinaxlib-5.8.1.tar.gz.

File metadata

Download URL: pinaxlib-5.8.1.tar.gz
Upload date: May 10, 2026
Size: 82.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pinaxlib-5.8.1.tar.gz
Algorithm	Hash digest
SHA256	`0b883a9368a8d17c0e4d889b1e13a1a4dfce78b4029362f58f25c0b642739917`
MD5	`f439f943abba6bf49a52d2a9094e7f73`
BLAKE2b-256	`6684eefbdca3663a80f6416c821a07c2cb31453f86fe9bac62ea81e8dca37787`

See more details on using hashes here.

File details

Details for the file pinaxlib-5.8.1-py3-none-any.whl.

File metadata

Download URL: pinaxlib-5.8.1-py3-none-any.whl
Upload date: May 10, 2026
Size: 93.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pinaxlib-5.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2830efd87ee2ebcd25801996a5c712ba17097989e97249468dada49e02bd3d08`
MD5	`3fd90d46ed98367f5b3ae17d83275627`
BLAKE2b-256	`0af578ac03145e67941bf7771609df16cbeab9bd5069bc3bdb04acc12504dd3a`

See more details on using hashes here.

pinaxlib 5.8.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pinax

What it does

Installation

Quick start

Five dataset kinds

Source connectors

SDMX (Eurostat, OECD, BIS, ABS, ...)

Statistics Canada

CKAN (open.canada.ca, data.gov, ...)

Scope-based graph traversal

Query API

Structured filters

Cross-entity queries

MAP column filters

Selective relationship loading

Lightweight projections

Full-text search and facets

Lineage and provenance (PROV-O)

Architecture

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes