Queryable open data catalog engine for DCAT-AP, StatDCAT-AP, CKAN, and SDMX
Project description
pinax
Note: This library has been written extensively with AI assistance (Claude Code). Users who are not comfortable with AI-generated code should take that into account before adopting it.
pinax is a Python library for building, managing, and querying statistical metadata catalogs. Named after "bibliographic work composed by Callimachus (310/305–240 BCE) that is popularly considered to be the first library catalog in the West"Pinax
Inspired by DCAT and the broader landscape of metadata standards (SDMX, DDI), pinax provides a general-purpose engine for catalog storage, discovery, and retrieval — designed to be embedded in ETL pipelines, data platforms, and analytical tooling rather than used as a standalone application. It pairs naturally with domain-specific libraries like sdmxlib for standards-aware workflows.
import pinax as pk
What it does
pinax materializes remote catalog and structural metadata into a local DuckDB database and exposes a fluent Python API for discovery queries — filtering by publisher, theme, dimension, code, free text, or provenance lineage.
The catalog is a materialized graph. Rather than federated queries across separate REST endpoints, ingest pulls the full structural model into one database. SQL JOINs are the graph traversal. No network round-trips at query time.
Installation
Requires Python 3.13+. Managed with uv.
uv add pinaxlib
# or: pip install pinaxlib
The distribution is published on PyPI as pinaxlib; the import name remains pinax.
Quick start
import pinax as pk
import pinax.query as q
import sdmxlib as sl
# Creates my_catalog/bundle.duckdb and my_catalog/parquet/
with pk.Catalog("my_catalog").open() as store:
with sl.RestRegistry(sl.Provider.ESTAT) as reg:
reg.get(sl.Dataflow, agency="ESTAT", id="NAMA_10_GDP").resolve()
pk.ingest_sdmx(store, reg.registry)
# Catalog query — no network needed
results = (
store.query(pk.AggregateDataset)
.filter(q.has_code("geo", "DE"))
.search("GDP")
.with_facets("themes", "frequency")
.execute()
)
print(results.total, "datasets found")
print(results.facets["themes"])
Five dataset kinds
pinax uses a discriminated union of dataset types aligned with DCAT-AP and StatDCAT-AP:
# Generic datasets (CKAN, open portals)
pk.OpenDataset(identifier="co2-2024", title=i18n("CO2 Emissions 2024"), ...)
# Statistical tables with SDMX structure (dimensions, codelists)
pk.AggregateDataset(identifier="ESTAT:UNE_RT_M(1.0)", ..., sdmx_dataflow_urn="urn:...")
# Survey microdata with variable-level metadata
pk.MicrodataDataset(identifier="lfs-2023", ..., variables=[...])
# Spatial datasets with bounding box and CRS
pk.GeospatialDataset(identifier="boundaries-2024", ..., crs="EPSG:4326")
# Articles, reports, and analytical publications
pk.PublicationDataset(identifier="pub-71-607-x", ..., doi="10.25318/...", authors=[...])
All five share the same store and query API. Use pk.BaseDataset as the bound
for generic code; pk.Dataset is a type alias for the full union.
Source connectors
SDMX (Eurostat, OECD, BIS, ABS, ...)
import sdmxlib as sl
import pinax as pk
with pk.Catalog("my_catalog").open() as store:
with sl.RestRegistry(sl.Provider.ESTAT) as reg:
for df_id in ["NAMA_10_GDP", "UNE_RT_M", "PRIC_HPI_IDX"]:
reg.get(sl.Dataflow, agency="ESTAT", id=df_id).resolve()
pk.ingest_sdmx(store, reg.registry)
# Stream observation data to Parquet, attach as Distribution
pk.ingest_data(store, reg, "ESTAT:NAMA_10_GDP(latest)", measure_dim="na_item")
Statistics Canada
from pinax.sources.statcan import WDSClient, NDMClient
with pk.Catalog("statcan").open() as store:
with WDSClient() as wds:
pk.ingest_statcan_table(store, wds, "14100287") # Labour Force Survey
with NDMClient() as ndm:
pk.ingest_statcan_publications(store, ndm, product_type="82", limit=200)
CKAN (open.canada.ca, data.gov, ...)
from pinax.sources.ckan import CkanClient
with pk.Catalog("open_canada").open() as store:
with CkanClient("https://open.canada.ca/data") as client:
pk.ingest_ckan(store, client, organization="statcan", rows=500)
Scope-based graph traversal
pinax exposes a lazy, scope-based API for navigating the catalog graph.
Navigation builds scope objects without executing SQL; only terminal methods
(.collect(), .count()) hit the database.
import pinax as pk
store = pk.Catalog("my_catalog").open()
# Navigate themes — no SQL until .collect()
concepts = store.themes["statcan"].collect() # ItemList[Concept]
concept = store.themes["statcan"]["13"].collect() # Concept
# Cross-entity navigation — .datasets returns a lazy QueryBuilder
datasets = store.themes["statcan"]["13"].datasets.collect()
# Enrich with sub-traversal expressions (like Polars' pl.col())
store.themes["statcan"].enrich(
n=pk.each("datasets").count(),
has_data=pk.each("datasets").exists(),
).collect()
# Codelist navigation — pk.urn builds URN strings for you
codes = store.codelist(pk.urn.codelist("SDMX", "CL_GEO")).collect() # ItemList[Code]
code = store.codelist(pk.urn.codelist("SDMX", "CL_GEO"))["ON"].collect() # Code
# CodelistsScope — parallel to ThemesScope, supports enrich
# Enriched output includes labels resolved via sdmx.localized_text
store.codelists.lang("en").enrich(n=pk.each("datasets").count()).collect()
# → [{"urn": "...", "label": "Geography", "n": 5}, ...]
# Filter codelists by label text (SQL-level, case-insensitive)
store.codelists.filter(text_contains="geo").enrich(n=pk.each("datasets").count()).collect()
store.codelist(urn).label("en") # quick name lookup
# Enriched per-code output includes code name labels
store.codelist(urn).lang("en").enrich(n=pk.each("datasets").count()).collect()
# → [{"code_id": "ON", "label": "Ontario", "n": 5}, ...]
# Batch label resolution — single SQL query for many codes
store.codelist(geo_urn).batch_labels(["CA", "US", "DE"])
# → {"CA": "Canada", "US": "United States", "DE": "Germany"}
# Across multiple codelists at once
store.codelists.batch_labels([(geo_urn, "CA"), (freq_urn, "A")], lang="en")
# → {(geo_urn, "CA"): "Canada", (freq_urn, "A"): "Annual"}
# Dimension traversal
dims = store.dimensions(ds).collect() # ItemList[DimensionInfo]
codelist = store.dimensions(ds)["GEO"].codelist # CodelistScope (lazy)
Scope classes: ConceptSchemesScope, ConceptSchemeScope, ThemesScope,
SchemeScope, ConceptScope, CodelistsScope, CodelistScope, CodeScope,
DimensionsScope, DimensionScope.
Expression types: pk.each("edge") creates a context-free sub-traversal
expression. Reusable across .enrich(), .filter(), and .sort_by().
Query API
Structured filters
import pinax.query as q
# Field filters (keyword arguments)
store.query(pk.AggregateDataset).filter(publisher="ESTAT", status="current").all()
# Composable filter objects
store.query(pk.AggregateDataset).filter(
q.has_code("geo", "DE"), # datasets with GEO=DE in their codelist
q.has_dimensions(["geo", "freq"]), # datasets with both GEO and FREQ dimensions
).all()
# Distribution and service filters
store.query(pk.OpenDataset).filter(
q.distribution(format="CSV"),
q.has_service(endpoint_url="https://..."),
).all()
Cross-entity queries
# Agents that publish datasets with code GEO=CA
store.query(pk.Agent).filter(
q.publishes(q.has_code("geo", "CA"), kind="aggregate")
).all()
# Data services serving aggregate datasets
store.query(pk.DataService).filter(
q.serves(kind="aggregate")
).all()
MAP column filters
# Spatial coverage filter
store.query(pk.BaseDataset).filter(q.has_spatial("Canada")).all()
# Keyword filter
store.query(pk.BaseDataset).filter(q.has_keyword("employment")).all()
# Title/description search (case-insensitive ILIKE)
store.query(pk.BaseDataset).filter(q.title_contains("GDP")).all()
store.query(pk.BaseDataset).filter(q.description_contains("quarterly")).all()
# Sort by multilingual title
store.query(pk.BaseDataset).sort_by("title", lang="en").all()
store.query(pk.BaseDataset).sort_by("title", lang="en", desc=True).all()
Selective relationship loading
By default, querying a list of datasets loads all relationships (~17 queries).
Use .include() to declare exactly which relationships to batch-load — the rest
are set to an UNLOADED sentinel:
# Only load publisher and themes — 3 queries instead of ~17
results = (
store.query(pk.BaseDataset)
.filter(status="published")
.include("publisher", "themes", "keywords")
.sort_by("issued", desc=True)
.limit(20)
.all()
)
# Explicit full hydration — useful to make the cost visible at the call site
results = store.query(pk.BaseDataset).filter(...).full().all()
# get() always loads everything — no include() needed
ds = store.get(pk.AggregateDataset, "ESTAT:NAMA_10_GDP(1.0)")
Unloaded fields raise pk.NotLoadedError on access. Use pk.is_unloaded(value)
to check before accessing:
ds = store.query(pk.BaseDataset).include("publisher").first()
ds.publisher.name # OK
ds.themes[0] # raises NotLoadedError: 'themes' was not loaded
if not pk.is_unloaded(ds.themes):
print(ds.themes)
Valid relationship names: publisher, contact_point, frequency, licence,
themes, subject, dataset_type, keywords, spatial_coverage,
distributions, conforms_to, quality_annotations, provenance,
dimension_names, variables, feature_types, authors.
Lightweight projections
When you need just a few columns (e.g. autocomplete), projection modifiers bypass full object reconstruction — a single SQL query:
# Row projection — .select() + .rows() returns Row objects (dict subclass)
rows = (
store.query(pk.BaseDataset)
.filter(q.title_contains("GDP"))
.sort_by("title", lang="en")
.limit(3)
.select("identifier", "title", lang="en")
.rows()
)
# → [Row({"identifier": "GDP", "title": "GDP Growth"}), ...]
rows[0].identifier # attribute-style access
rows[0]["title"] # dict-style access — both work
# Flat value projection — .scalars() + .values() returns bare values
ids = store.query(pk.BaseDataset).filter(status="current").scalars("identifier").values()
# → ["EXR", "M1", "UNEMP", ...]
# Existence check — no object reconstruction
if store.query(pk.AggregateDataset).filter(q.has_code("geo", "DE")).exists():
...
Full-text search and facets
# BM25 search across titles, descriptions, keywords, themes, and dimensions
results = store.search("unemployment", limit=20)
# Combined search and filter
results = (
store.query(pk.AggregateDataset)
.filter(publisher="ESTAT")
.search("labour force")
.with_facets("themes", "frequency")
.execute()
)
print(results.facets["themes"]) # {"Labour": 18, "Economy": 6, ...}
# Aggregation counts
counts = store.facets("publisher", "themes", "frequency", "status")
Lineage and provenance (PROV-O)
# Record derivation relationships between datasets
store.add_lineage(
"lfs-microdata",
"14100287",
"aggregated_from",
activity_type="aggregation",
activity_label="LFS monthly tabulation",
confidence="asserted",
)
# Transitive upstream/downstream traversal — returns QueryBuilder for chaining
ancestors = store.dataset("14100287").upstream(depth=5).collect()
dependents = store.dataset("CL_GEO").downstream(relationship="uses_classification").collect()
# Chain additional filters after traversal
current = store.dataset("14100287").upstream().filter(status="current").collect()
# Inspect lineage records
rows = store.dataset("14100287").lineage_records(role="target", relationship="aggregated_from")
Architecture
Three layers share one DuckDB database:
Discovery layer — Catalog (dataset, agent, concept_scheme, concept, distribution, lineage)
Structural layer — sdmxlib tables (dataflows, dsd_components, codes, codelists)
Observation layer — Polars / Parquet (actual time-series data)
pinax owns the discovery layer. sdmxlib owns the structural layer. Both write
to the same DuckDB connection — queries JOIN freely across both. Parquet files live
alongside the database and are referenced via DCAT Distribution records.
Development
just test # unit tests
just test-integration # integration tests (no network)
just test-live # live tests against real SDMX endpoints
just lint # ruff check
just typecheck # basedpyright
just docs # local docs server
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pinaxlib-5.8.1.tar.gz.
File metadata
- Download URL: pinaxlib-5.8.1.tar.gz
- Upload date:
- Size: 82.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b883a9368a8d17c0e4d889b1e13a1a4dfce78b4029362f58f25c0b642739917
|
|
| MD5 |
f439f943abba6bf49a52d2a9094e7f73
|
|
| BLAKE2b-256 |
6684eefbdca3663a80f6416c821a07c2cb31453f86fe9bac62ea81e8dca37787
|
File details
Details for the file pinaxlib-5.8.1-py3-none-any.whl.
File metadata
- Download URL: pinaxlib-5.8.1-py3-none-any.whl
- Upload date:
- Size: 93.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2830efd87ee2ebcd25801996a5c712ba17097989e97249468dada49e02bd3d08
|
|
| MD5 |
3fd90d46ed98367f5b3ae17d83275627
|
|
| BLAKE2b-256 |
0af578ac03145e67941bf7771609df16cbeab9bd5069bc3bdb04acc12504dd3a
|