"Python package dataio of the util workflow stage of the BONSAI database"

These details have not been verified by PyPI

Project links

Project description

BONSAI dataio

The BONSAI dataio Python package is a part of the Getting The Data Right project.

The dataio package is designed to facilitate the management of data resources through easy-to-use APIs for reading, writing, and updating data in various formats, with a focus on maintaining comprehensive metadata about each data resource.

Read the BONSAI data architecture for background information on table types, foreign keys, and related concepts.

Installation

To install the dataio package as a standalone package type in the command line:

pip install bonsai_dataio

Or to install a specific version:

pip install git+ssh://git@gitlab.com/bonsamurais/bonsai/util/dataio@<version>

dataio uses HDF5 stores for storing matrices. You possibly need to install HDF5 systemwide before you can install dataio. On Mac you can do this by using brew: brew install hdf5.

To install dataio as dependency of another package, add it to the field install_requires in setup.cfg:

util_dataio @ git+ssh://git@gitlab.com/bonsamurais/bonsai/util/dataio@<version>

You can find the list of versions here.

Key Features

Resource Management: Manage your data resources with a structured CSV repository that supports adding, updating, and listing data resources or use the api to make calls to the postgres database.
Data Validation: Validate data against predefined schemas before it is saved to ensure integrity.
Data Retrieval and Storage: Easily retrieve and store dataframes directly from/to specific tasks and data resources.

Usage

Setting Up the Environment

Before using the dataio package, set the BONSAI_HOME environment variable to point to your project's home directory where data resources will be managed:

import os
from pathlib import Path

os.environ["BONSAI_HOME"] = str(Path("path/to/your/data").absolute())

If you don't want to set this variable, you need to provide an absolute path when setting up your resource file and then make sure that in that resource file all locations are also absolute.

The execption to this is when you interact with the data through the online API, in this case you also don't need to set the env.

Creating a Resource Repository

Instantiate a resource repository to manage your data resources:

from dataio.resources import ResourceRepository

When creating your resource repo you can designate if you want to use the local storage method or the api's to the database by specifying the parameter storage_method. If you want to use the local csv file storage you can do this.

repo = ResourceRepository(db_path=Path("path/to/your/data"), storage_method="local")

If you want to access the database you need to supply either a username and password combination or a authenticated acess token. A walkthrough of how to become a user and get a token can be found here: https://lca.aau.dk/api/docs/. The Api automatically caches the latest three calls at "./data_cache/". These can be changed with the parameters MAX_CACHE_FILES and cache_dir respectively.

repo = ResourceRepository(storage_method="api",API_token="some_token",username="some_username",password="some_password",                    
cache_dir="some_cache_dir",MAX_CACHE_FILES="3")

Adding a New Resource

Add new resources to the repository:

from dataio.schemas.bonsai_api import DataResource
from datetime import date

resource = DataResource(
    name="new_resource",
    schema_name="Schema",
    location="relative/or/absolut/path/to/the/resource.csv",
    task_name="task1",
    stage="collect",
    data_flow_direction="input",
    data_version="1.0.1",
    code_version="2.0.3",
    comment="Initial test comment",
    last_update=date.today(),
    created_by="Test Person",
    dag_run_id="12345",
    api_endpoint=None,
)

repo.add_to_resource_list(resource)

Not all fields need to be set. The schema name needs to correspond to one of the schema names defined in dataio.schemas.

The locations of resources in the local ResourceRepository are all relative to the location of the resources.csv file! This is the path provided when initializing the repository.

When adding resources to the database a uuid in the id column will be added automatically, which is used by the database for foreign keys. If not specified by the user, a default api_endpoint is assigned based on the schema. You can find all the endpoints associated with a schema with the through schema.get_api_endpoint.

Updating an Existing Resource

Update an existing resource in your repository:

resource.created_by = "New Name"
repo.update_resource_list(resource)

Retrieving Resource Information

Retrieve specific resource information using filters:

result = repo.get_resource_info(name="new_resource")
print(result)

Writing and Reading Data

You can store and read data using different file formats. The way data is stored depends on the file extension used in the location field. The location field also is always relative to the resources.csv file. Please don't put absolute paths there.

The last_update field is set automatically by dataio. Please don't overwrite this field.

Currently the following data formats are supported:

for dictionaries

[".json", ".yaml"]

for tabular data

[".parquet", ".xlsx", ".xls", ".csv", ".pkl"]

for matrices

[".hdf5", ".h5"]

Note: matrices need to use a MatrixModel schema.

Write data to a local resource and then read it back to verify:

import pandas as pd

data_to_add = pd.DataFrame({
    "flow_code": ["FC100"],
    "description": ["Emission from transportation"],
    "unit_reference": ["unit"],
    "region_code": ["US"],
    "value": [123.45],
    "unit_emission": ["tonnes CO2eq"],
})

repo.write_dataframe_for_task(
    name="new_resource",
    data=data_to_add,
    task_name="footprint_calculation",
    stage="calculation",
    location="calculation/footprints/{version}/footprints.csv",
    schema_name="Footprint",
    data_flow_direction="output",
    data_version="1.0",
    code_version="1.1",
    comment="Newly added data for emissions",
    created_by="Test Person",
    dag_run_id="run200",
    api_endpoint=None,
)
# Read the data back
retrieved_data = repo.get_dataframe_for_task("new_resource")
print(retrieved_data)

When loading data from the api a api_endpoint is needed. This is added automatically when adding a resource repository to the database and designates exactly which api to use.

Loading data into Bonsai classification

=======

The package contains several tools to convert data into a Bonsai-aligned format.

Certain data fields need to be mapped to Bonsai classification using correspondence tables provided by the classifications package. This is done by convert_dataframe_to_bonsai_classification(). The mapping follows the following logic:

one-to-one correspondence: directly use the corresponding code.
many-to-one correspondence: sum the values of the data entry.
one-to-many correspondence: create a composite type of Bonsai code (e.g. 'ai_10|ai_12').
many-to-many correspondence: These fields are ignored and left as-is. They need to be separately dealt with by hand after loading the data.

Configuration files

Three config files in dataio located in src/config:

global.config.yaml: config for system variables
data_attributes.config.yaml: config for variables related to the data
airflow_attributes.config.json: config related to airflow

Read the readme.md in src/config for more explanation about the values in airflow_attributes.config.json

To use a local yaml config file (other than the one for airflow), it is possible to point at a local file by defining in the Config() argument config_path

config_file_path = "_path_to_config_yaml_"
task_name = "current_task_name"
Config(task_name, config_path=config_file_path)

Testing

To ensure everything is working as expected, run the provided test suite:

pytest tests -vv

tox

This will run through a series of automated tests, verifying the functionality of adding, updating, and retrieving data resources, as well as reading and writing data based on resource descriptions.

Contributions

Contributions to the dataio package are welcome. Please ensure to follow the coding standards and write tests for new features. Submit pull requests to our repository for review.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.8.19

Oct 20, 2025

This version

4.8.17

Sep 1, 2025

4.8.16

Sep 1, 2025

4.8.15

Aug 13, 2025

4.8.14

Aug 5, 2025

4.8.13

Jul 11, 2025

4.8.12

Jul 11, 2025

4.8.11

Jul 8, 2025

4.8.10

Jun 27, 2025

4.8.9

Jun 26, 2025

4.8.8

Jun 16, 2025

4.8.7

Jun 11, 2025

4.8.6

Jun 6, 2025

4.8.5

Jun 4, 2025

4.8.4

Jun 3, 2025

4.8.3

May 28, 2025

4.8.2

May 28, 2025

4.8.1

May 28, 2025

4.8.0

May 27, 2025

4.7.9

May 21, 2025

4.7.8

May 16, 2025

4.7.7

May 16, 2025

4.7.6

May 14, 2025

4.7.5

May 12, 2025

4.7.3

Apr 29, 2025

4.5.12

Jan 27, 2025

4.5.11

Jan 22, 2025

4.3.2

Dec 20, 2024

4.3.1

Nov 12, 2024

4.3.0

Nov 11, 2024

4.1.2

Oct 1, 2024

4.1.1

Oct 1, 2024

4.0.4

Sep 16, 2024

4.0.3

Sep 13, 2024

4.0.2

Sep 5, 2024

4.0.0

Aug 19, 2024

0.4.18

Sep 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bonsai_dataio-4.8.17.tar.gz (229.0 kB view details)

Uploaded Sep 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bonsai_dataio-4.8.17-py3-none-any.whl (140.8 kB view details)

Uploaded Sep 1, 2025 Python 3

File details

Details for the file bonsai_dataio-4.8.17.tar.gz.

File metadata

Download URL: bonsai_dataio-4.8.17.tar.gz
Upload date: Sep 1, 2025
Size: 229.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for bonsai_dataio-4.8.17.tar.gz
Algorithm	Hash digest
SHA256	`40e75e07973e3da9f55ac09f27f2ad8de70e94b3f27bc0c93947d69e1d23e156`
MD5	`81a8c0e1dec1e556f6fc5a3f015cffbf`
BLAKE2b-256	`a77a48d11331704ead2a69b092b38d98eb7f761d5197b3b162eb28109087e6e3`

See more details on using hashes here.

File details

Details for the file bonsai_dataio-4.8.17-py3-none-any.whl.

File metadata

Download URL: bonsai_dataio-4.8.17-py3-none-any.whl
Upload date: Sep 1, 2025
Size: 140.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for bonsai_dataio-4.8.17-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bbf7449fd5e540b5fc689acf73c29a174ef65187270e5b9ba0e90761f33da7e9`
MD5	`52d179df66523fe58da4ea22ba0b08fe`
BLAKE2b-256	`76bb44f49bee3e53841fffe7f8cfebd9813a81cedf58c840cb78fb413b1a2bae`

See more details on using hashes here.

bonsai-dataio 4.8.17

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BONSAI dataio

Installation

Key Features

Usage

Setting Up the Environment

Creating a Resource Repository

Adding a New Resource

Updating an Existing Resource

Retrieving Resource Information

Writing and Reading Data

for dictionaries

for tabular data

for matrices

Loading data into Bonsai classification

Configuration files

Testing

Contributions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes