Skip to main content

A library for loading datasets and models whose metadata is provided in the DCAT-AP format.

Project description

DCAT-AP Hub

dcat-ap-hub is a Python library for working with datasets and pretrained models described using DCAT-AP metadata. It is built around a practical workflow that resolves metadata, downloads artifacts, and loads datasets or models through a single interface. Currently, metadata parsing supports JSON-LD from direct URLs, content negotiation, and local files.

Typical Workflow

  1. Retrieve dataset metadata in DCAT-AP from:

    • remote JSON-LD URLs (Dataset.from_url(...))
    • local metadata files (Dataset.from_file(...))
    • local directories that contain metadata files (Dataset.from_directory(...))
  2. Download files referenced by distributions and related resources (dcat:downloadURL) into a local dataset directory.

  3. Load files or models for use in code:

    • Load files as a lazy FileCollection with built-in loaders for common formats such as CSV, Excel, JSON, Parquet, images, PDF, text, HTML/XML, and NumPy arrays.
    • Load pretrained models through Hugging Face, ONNX, or sklearn-style model scripts.

Benchmarking With Catalogues

Optionally, related resources can be used to attach a processor script that is detected automatically and applied to transform raw files. This enables the definition of multi-dataset benchmarks as DCAT-AP catalogues, since benchmarking requires each dataset to provide a fixed train-test split, which can be generated through these processor scripts.

Requirements for Metadata

  • Each dataset metadata record must include a dcat:Dataset entry.
  • Entries with @type set to mls:Model are treated as models.
  • Roles for distributions (dcat:Distribution) and related resources (rdfs:Resource) can be defined through dct:conformsTo and/or dct:format, allowing the specification of model types or processors.
  • The dcat:downloadURL field identifies the files to be downloaded.

How To Install

# Base install (datasets, processing)
pip install dcat-ap-hub

# Install with ONNX model loading support
pip install "dcat-ap-hub[onnx]"

# Install with Hugging Face model loading support
pip install "dcat-ap-hub[huggingface]"

Example of Loading a Dataset

from dcat_ap_hub import Dataset

url = "https://ki-daten.hlrs.de/de/dataset/https-piveau-io-set-data-predictive-maintenance-ttl"

ds = Dataset.from_url(url)
files = ds.download(data_dir="./data")

Example of Loading a Huggingface Model

from dcat_ap_hub import Dataset

url = "https://ki-daten.hlrs.de/de/model/prajjwal1-bert-tiny"

ds = Dataset.from_url(url)
files = ds.download(data_dir="./data")
model, processor, metadata = ds.load_model(model_dir="./models")

Example of Loading a SKLearn Model

from dcat_ap_hub import Dataset

url = "https://ki-daten.hlrs.de/de/model/https-piveau-io-set-data-pre-trained-transformer"

ds = Dataset.from_url(url)
files = ds.download(data_dir="./data")
model = ds.load_model(model_dir="./models")

Example of Processing a Dataset if Available

from dcat_ap_hub import Dataset

url = "https://ki-daten.hlrs.de/de/dataset/https-piveau-io-set-data-predictive-maintenance-ttl"

ds = Dataset.from_url(url)
files = ds.download(data_dir="./data")
processed = ds.process(processed_dir="./processed")

Funding

This project was developed using resources from the HammerHAI project, an EU co-funded AI Factory initiative operated by the High-Performance Computing Center Stuttgart and supported by the European Commission as well as German federal and state ministries. It is funded by the European High Performance Computing Joint Undertaking under Grant Agreement No. 101234027.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcat_ap_hub-0.1.4.tar.gz (135.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dcat_ap_hub-0.1.4-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file dcat_ap_hub-0.1.4.tar.gz.

File metadata

  • Download URL: dcat_ap_hub-0.1.4.tar.gz
  • Upload date:
  • Size: 135.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.9

File hashes

Hashes for dcat_ap_hub-0.1.4.tar.gz
Algorithm Hash digest
SHA256 f79d74985e6a05834c0e301c9334cd9b580d5885842c14b9eebbd2d57efff79a
MD5 724b78bee24689eb7df9b68e01858e7f
BLAKE2b-256 ed8cc0143951e25f1ecc155d93b1bb57b7c767465687f356e99edf3064fd90c4

See more details on using hashes here.

File details

Details for the file dcat_ap_hub-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for dcat_ap_hub-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 9a0992e2b2eaebcbb347b6f3c606b4ec15c05918f0d4ccbe3fb1437ce2ff7511
MD5 8462e3230342a49ed23dcde52602e7dc
BLAKE2b-256 c5c231d797c5e5be7522f80cf0e94c625c62639ce9a1d7a7c43b77a5b5f28a4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page