Skip to main content

This is the API client of Open Innovation MLOps Platform - Dataset

Project description

Dataset API Client

Welcome to the documentation for the Open Innovation Dataset API Client! This guide provides an overview of the Dataset API client library, its installation, usage, and available methods. To get started, install the library using pip. Once installed, import the library into your Python project. Initialize the API client by providing the API server's hostname and an access token.

Installation and Setup

To install the Open Innovation Dataset API Client, follow these steps:

  1. Install the required dependencies by running the following command:

    pip install oip-dateset-client
    
  2. Import the DatasetClient class into your Python project:

    from oip_dateset_client.dataset import DatasetClient
    

Initialization

To set up the Dataset API Client, you'll need to connect to the Dataset Server:

api_host = "http://192.168.1.35:8000" # host of the server
api_key = "72e3f81c-8c75-4f88-9358-d36a3a50ef36" # api-key of the user
workspace_name = "default_workspace" # workspace name

DatasetClient.connect(api_host=api_host, api_key=api_key, workspace_name=workspace_name)

Parameters

  • api_host (str, required): The hostname of the Dataset API server.
  • access_token (str, required): Your API authentication token.
  • workspace_name (str, required): Your workspace name.

You can access both the Dataset API Server and your access token by requesting them through our MLOps Web UI.

DatasetClient Methods

The DatasetClient class serves as the central component of the oip_dataset_client. It acts as the core interface through which the client interacts with and manages datasets throughout their lifecycle, including creating, adding files, uploading, downloading, and finalizing. The upcoming section will thoroughly cover the essential methods in this class.

create

Create new datasets, these datasets serve as organized collections of data files, you have the flexibility to specify one or multiple parent datasets, allowing the newly created dataset to inherit all the files associated with its parent(s). Additionally, you can define the version for your dataset. This versioning feature enhances traceability and the ability to manage different iterations of your datasets. Furthermore, You can specify whether this dataset serves as the primary version among others and apply descriptive tags to categorize and detail the dataset's content.

my_dataset = DatasetClient.create(
    name="my_dataset",
    parent_datasets=[
        "8fb30519-8326-4f38-aa53-83ef35b65e6a",
        "a2c0c7b1-bb5f-49a1-8a47-2e1679a726bb",
    ],
    version="2.0.1",
    is_main=True
    tags=["testing","CSV","NASA"]
    description="a CSV testing dataset from NASA"
)

Parameters

  • name (str, required): Name of the new dataset.
  • parent_datsets (list[str], optional): A list of parent datasets to extend the new dataset by adding all the files from their respective parent datasets.
  • version (str, optional): Version of the new dataset, if no version is specified during creation, the default version will be set to 1.0.0 for the dataset's First version. For the next versions, we will automatically increment the highest semantic version available.
  • is_main (bool, optional): True if the new dataset is the main version.
  • tags (list[str], optional): Descriptive tags categorize datasets by keywords, detailing their subject matter, domain, or specific topics for better identification and organization.
  • description (str, optional): A brief description of the new dataset.

Returns

  • none: Does not return any results.

Raises

  • ValueError: If the name is empty or None.
  • ValueError: If any of the parent_datasets is not completed.

add_files

You can add one or multiple files to the newly created dataset. When adding files, a thorough check is performed to ensure data integrity. The system checks if the files already exist in the dataset. If the file has already been uploaded or if it exists within one of the dataset's parent. However, if the file is not present or exists with different content, it is promptly added. In cases where files are found to be identical, they are simply ignored. This streamlined process ensures efficient management of your dataset's contents.

my_dataset.add_files(path="absolute_path_to_the_files")

Parameters

  • path (str, required): Path to the files we want to add to our dataset.
  • wildcard (Union[str, List[str]], optional): Add a selective set of files by using wildcard matching, which can be a single string or a list of wildcard patterns.
  • local_base_folder (str, optional): Files will be located based on their relative path from local_base_folder.
  • dataset_path (str, optional): Relative path where in the dataset the folder/files should be located
  • recursive (bool, optional): If True, match all wildcard files recursively. Defaults to True
  • max_workers (int, optional): The number of threads to add the files with. Defaults to the number of logical cores

Returns

  • int: The number of files that have been added.
  • int: The number of files that have been modified.

Raises

  • Exception: If the dataset is in a final state, which includes completed, aborted, or failed.
  • ValueError: If the specified path to the files does not exist.

remove_files

You have the flexibility to remove one or multiple files from the dataset. This feature is particularly valuable when you need to eliminate specific files from the parent datasets.

my_dataset.remove_files(path="relative_path_to_the_files")

Parameters

  • dataset_path (str, required): path to the files to remove. The path is always relative to the dataset (e.g folder/file.bin).
  • recursive (bool, optional): If True, match all wildcard files recursively. Defaults to True.

Returns

  • int: The number of files that have been removed.

Raises

  • Exception: If the dataset is in a final state, which includes completed, aborted, or failed.

upload

After adding or removing files, you can proceed to the upload step, where you upload all the files to a storage provider. It's important to note that only files that haven't been uploaded yet are included in this process. In other words, this operation covers the direct files of the dataset, excluding the parent files, as those are already uploaded.

my_dataset.upload()

Returns

  • none: Does not return any results.

Raises

  • Exception: If the dataset is in a final state, which includes completed, aborted, or failed.
  • Exception: If the upload failed.

finalize

Once all the files in the dataset have been successfully uploaded, you can proceed to finalize the dataset. It's important to note that once a dataset is finalized, no further operations can be performed on it.

# if the files are not uploaded yet
# we can use the auto_upload=true ie my_dataset.finalize(auto_upload=true)
my_dataset.finalize()

Parameters

  • auto_upload (bool, optional): Automatically upload dataset if not uploaded yet. Defaults to False.

Returns

  • none: Does not return any results.

Raises

  • Exception: If there is a pending upload.
  • Exception: If the dataset's status is not valid for finalization.

get

Get a specific Dataset. If multiple datasets are found, the dataset with the highest semantic version is returned.

my_dataset = DatasetClient.get(dataset_name="my_dataset")

Parameters

  • dataset_id (str, optional): Requested dataset ID.
  • dataset_name (str, optional): Requested dataset name.
  • dataset_version (str, optional): Requested version of the Dataset.
  • only_completed (bool, optional): Return only the completed dataset.
  • auto_create (bool, optional): If the search result is empty and the filter is based on the dataset_name, create a new dataset.

Returns

  • Dataset: Returns a Dataset object.

Raises

  • ValueError: If the selection criteria are not met. Didn't provide id/name correctly.
  • ValueError: If the query result is empty, it means that no dataset matching the provided selection criteria could be found.

get_local_copy (internal dataset)

After finalizing the dataset, you have the option to download a local copy of it for further use. This local copy includes all the files of the dataset, including the parent dataset files, all conveniently placed in a single folder. If the dataset version is not explicitly specified in the download parameters, the dataset with the highest semantic version will be used.

my_dataset = DatasetClient.get(dataset_name="my_dataset")
dataset.get_local_copy()

Returns

  • none: Does not return any results.

Raises

  • Exception: If the dataset is in a final state, which includes completed, aborted, or failed.
  • Exception: If we are unable to unzip a compressed file.
  • Exception: If we encounter a failure while attempting to copy a file from a source folder to a target folder.

get_local_copy (external dataset)

To include external files in your dataset (files that aren't present either locally or in our internal datasets), you should download them initially. Once downloaded, you can proceed to create your dataset and add the files you've obtained.

from oip_dateset_client.dataset import DatasetClient
from oip_dateset_client.StorageManager import StorageManager

cifar_path = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
local_path=StorageManager.get_local_copy(remote_url=cifar_path)
my_dataset = DatasetClient.create(name="my_dataset")
# add files   
my_dataset.add_files(path=local_path) 
my_dataset.upload()
# if the files are not uploaded yet
# we can use the auto_upload=true ie my_dataset.finalize(auto_upload=true)
my_dataset.finalize()

Parameters

  • remote_path (str, required): Dataset url.
  • target_folder (str, optional): The local directory where the dataset will be downloaded.
  • extract_archive (bool, optional): If true, and the file is compressed, proceed to extract it. Defaults to True.

Returns

  • str: path to the downloaded file.

Raises

  • Exception: If we encounter a failure while attempting to download the requested file.
  • Exception: If we are unable to unzip a compressed file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

oip_dataset_client-0.0.1-py3-none-any.whl (33.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page