This is the API client of Open Innovation MLOps Platform - Dataset
Project description
Dataset API Client
Welcome to the documentation for the Open Innovation Dataset API Client! This guide provides an overview of the Dataset API client library, its installation, usage, and available methods. To get started, install the library using pip. Once installed, import the library into your Python project. Initialize the API client by providing the API server's hostname and an access token.
Installation and Setup
To install the Open Innovation Dataset API Client, follow these steps:
-
Install the required dependencies by running the following command:
pip install oip-dateset-client
-
Import the DatasetClient class into your Python project:
from oip_dateset_client.dataset import DatasetClient
Initialization
To set up the Dataset API Client, you'll need to connect to the Dataset Server:
api_host = "http://192.168.1.35:8000" # host of the server
api_key = "72e3f81c-8c75-4f88-9358-d36a3a50ef36" # api-key of the user
workspace_name = "default_workspace" # workspace name
DatasetClient.connect(api_host=api_host, api_key=api_key, workspace_name=workspace_name)
Parameters
api_host
(str, required): The hostname of the Dataset API server.access_token
(str, required): Your API authentication token.workspace_name
(str, required): Your workspace name.
You can access both the Dataset API Server and your access token by requesting them through our MLOps Web UI.
DatasetClient Methods
The DatasetClient class serves as the central component of the oip_dataset_client. It acts as the core interface through which the client interacts with and manages datasets throughout their lifecycle, including creating, adding files, uploading, downloading, and finalizing. The upcoming section will thoroughly cover the essential methods in this class.
create
Create new datasets, these datasets serve as organized collections of data files, you have the flexibility to specify one or multiple parent datasets, allowing the newly created dataset to inherit all the files associated with its parent(s). Additionally, you can define the version for your dataset. This versioning feature enhances traceability and the ability to manage different iterations of your datasets. Furthermore, You can specify whether this dataset serves as the primary version among others and apply descriptive tags to categorize and detail the dataset's content.
my_dataset = DatasetClient.create(
name="my_dataset",
parent_datasets=[
"8fb30519-8326-4f38-aa53-83ef35b65e6a",
"a2c0c7b1-bb5f-49a1-8a47-2e1679a726bb",
],
version="2.0.1",
is_main=True
tags=["testing","CSV","NASA"]
description="a CSV testing dataset from NASA"
)
Parameters
name
(str, required): Name of the new dataset.parent_datsets
(list[str], optional): A list of parent datasets to extend the new dataset by adding all the files from their respective parent datasets.version
(str, optional): Version of the new dataset, if no version is specified during creation, the default version will be set to 1.0.0 for the dataset's First version. For the next versions, we will automatically increment the highest semantic version available.is_main
(bool, optional): True if the new dataset is the main version.tags
(list[str], optional): Descriptive tags categorize datasets by keywords, detailing their subject matter, domain, or specific topics for better identification and organization.description
(str, optional): A brief description of the new dataset.
Returns
none
: Does not return any results.
Raises
ValueError
: If thename
is empty orNone
.ValueError
: If any of theparent_datasets
is not completed.
add_files
You can add one or multiple files to the newly created dataset. When adding files, a thorough check is performed to ensure data integrity. The system checks if the files already exist in the dataset. If the file has already been uploaded or if it exists within one of the dataset's parent. However, if the file is not present or exists with different content, it is promptly added. In cases where files are found to be identical, they are simply ignored. This streamlined process ensures efficient management of your dataset's contents.
my_dataset.add_files(path="absolute_path_to_the_files")
Parameters
path
(str, required): Path to the files we want to add to our dataset.wildcard
(Union[str, List[str]], optional): Add a selective set of files by using wildcard matching, which can be a single string or a list of wildcard patterns.local_base_folder
(str, optional): Files will be located based on their relative path from local_base_folder.dataset_path
(str, optional): Relative path where in the dataset the folder/files should be locatedrecursive
(bool, optional): If True, match all wildcard files recursively. Defaults to Truemax_workers
(int, optional): The number of threads to add the files with. Defaults to the number of logical cores
Returns
int
: The number of files that have been added.int
: The number of files that have been modified.
Raises
Exception
: If the dataset is in a final state, which includescompleted
,aborted
, orfailed
.ValueError
: If the specified path to the files does not exist.
remove_files
You have the flexibility to remove one or multiple files from the dataset. This feature is particularly valuable when you need to eliminate specific files from the parent datasets.
my_dataset.remove_files(path="relative_path_to_the_files")
Parameters
dataset_path
(str, required): path to the files to remove. The path is always relative to the dataset (e.gfolder/file.bin
).recursive
(bool, optional): If True, match all wildcard files recursively. Defaults to True.
Returns
int
: The number of files that have been removed.
Raises
Exception
: If the dataset is in a final state, which includescompleted
,aborted
, orfailed
.
upload
After adding or removing files, you can proceed to the upload step, where you upload all the files to a storage provider. It's important to note that only files that haven't been uploaded yet are included in this process. In other words, this operation covers the direct files of the dataset, excluding the parent files, as those are already uploaded.
my_dataset.upload()
Returns
none
: Does not return any results.
Raises
Exception
: If the dataset is in a final state, which includescompleted
,aborted
, orfailed
.Exception
: If the upload failed.
finalize
Once all the files in the dataset have been successfully uploaded, you can proceed to finalize the dataset. It's important to note that once a dataset is finalized, no further operations can be performed on it.
# if the files are not uploaded yet
# we can use the auto_upload=true ie my_dataset.finalize(auto_upload=true)
my_dataset.finalize()
Parameters
auto_upload
(bool, optional): Automatically upload dataset if not uploaded yet. Defaults to False.
Returns
none
: Does not return any results.
Raises
Exception
: If there is a pending upload.Exception
: If the dataset's status is not valid for finalization.
get
Get a specific Dataset. If multiple datasets are found, the dataset with the highest semantic version is returned.
my_dataset = DatasetClient.get(dataset_name="my_dataset")
Parameters
dataset_id
(str, optional): Requested dataset ID.dataset_name
(str, optional): Requested dataset name.dataset_version
(str, optional): Requested version of the Dataset.only_completed
(bool, optional): Return only the completed dataset.auto_create
(bool, optional): If the search result is empty and the filter is based on the dataset_name, create a new dataset.
Returns
Dataset
: Returns a Dataset object.
Raises
ValueError
: If the selection criteria are not met. Didn't provide id/name correctly.ValueError
: If the query result is empty, it means that no dataset matching the provided selection criteria could be found.
get_local_copy (internal dataset)
After finalizing the dataset, you have the option to download a local copy of it for further use. This local copy includes all the files of the dataset, including the parent dataset files, all conveniently placed in a single folder. If the dataset version is not explicitly specified in the download parameters, the dataset with the highest semantic version will be used.
my_dataset = DatasetClient.get(dataset_name="my_dataset")
dataset.get_local_copy()
Returns
none
: Does not return any results.
Raises
Exception
: If the dataset is in a final state, which includescompleted
,aborted
, orfailed
.Exception
: If we are unable to unzip a compressed file.Exception
: If we encounter a failure while attempting to copy a file from a source folder to a target folder.
get_local_copy (external dataset)
To include external files in your dataset (files that aren't present either locally or in our internal datasets), you should download them initially. Once downloaded, you can proceed to create your dataset and add the files you've obtained.
from oip_dateset_client.dataset import DatasetClient
from oip_dateset_client.StorageManager import StorageManager
cifar_path = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
local_path=StorageManager.get_local_copy(remote_url=cifar_path)
my_dataset = DatasetClient.create(name="my_dataset")
# add files
my_dataset.add_files(path=local_path)
my_dataset.upload()
# if the files are not uploaded yet
# we can use the auto_upload=true ie my_dataset.finalize(auto_upload=true)
my_dataset.finalize()
Parameters
remote_path
(str, required): Dataset url.target_folder
(str, optional): The local directory where the dataset will be downloaded.extract_archive
(bool, optional): If true, and the file is compressed, proceed to extract it. Defaults to True.
Returns
str
: path to the downloaded file.
Raises
Exception
: If we encounter a failure while attempting to download the requested file.Exception
: If we are unable to unzip a compressed file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for oip_dataset_client-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2414fa5d291ae909ab690c7ae035bc9cb8aab7d07c31429a1bf8976a25837e2 |
|
MD5 | 7fdcbbf74faea91f4fba1b0b65417a47 |
|
BLAKE2b-256 | fff5d12d01401af036275b6c88903e3f3453b2cd9bc96c0693180015b3e35332 |