Command line interface for all things Cleanlab Studio
Project description
cleanlab-cli
Command line interface for all things Cleanlab Studio.
This currently supports generating dataset schema, uploading datasets into Cleanlab Studio, and downloading cleansets from Cleanlab Studio.
Installation
You can install the Cleanlab Studio CLI from PyPI with:
pip install cleanlab-cli
If you already have the CLI installed and wish to upgrade to the latest version, run:
pip install --upgrade cleanlab-cli
Workflow
Uploading datasets to Cleanlab Studio is a two-step process.
- We generate a schema describing the dataset and its data and feature types, which is verified by the user.
- Based on this schema, the dataset is parsed and uploaded to Cleanlab Studio.
Upload a dataset
To upload a dataset without first generating a schema (i.e. Cleanlab will suggest one for you):
cleanlab dataset upload -f [dataset filepath]
You will be asked to "Specify your dataset modality (text, tabular):"
.
- Enter
text
to only find label errors based on a single column of text in your dataset. - Enter
tabular
to find data and label issues based on any subset of the column features.
To upload a dataset with a schema:
cleanlab dataset upload -f [dataset filepath] -s [schema filepath]
To resume uploading a dataset whose upload was interrupted:
cleanlab dataset upload -f [dataset filepath] --id [dataset ID]
A dataset ID is generated and printed to the terminal the first time the dataset is uploaded. It can also be accessed by visiting https://app.cleanlab.ai/datasets and selecting 'Resume' for the relevant dataset.
Generate dataset schema
To generate a dataset schema (prior to uploading your dataset):
cleanlab dataset schema generate -f [dataset filepath]
- For
Id column:
, please enter the string name of of the column in your dataset that contains the id of each row. - For
Modality (text, tabular):
, please entertext
to only find label errors based on a single column of text, otherwise entertabular
to find data and label issues based on any subset of the column features.
To validate an existing schema, i.e. check that it is complete, well-formatted, and has data types with sensible feature types:
cleanlab dataset schema validate -s [schema filepath]
You may then wish to inspect the generated schema to check that the fields and metadata are correct.
Download clean labels
To download clean labels (i.e. labels that have been fixed through the Cleanlab Studio interface):
cleanlab cleanset download --id [cleanset ID]
To download clean labels and combine them with your local dataset:
cleanlab cleanset download --id [cleanset ID] -f [dataset filepath]
Commands
cleanlab login
authenticates you
Authenticates you when uploading datasets to Cleanlab Studio. Pass in your API key using --key [API key]
. Your API key
can be accessed at https://app.cleanlab.ai/upload.
cleanlab dataset schema generate
generates dataset schemas
Generates a schema based on your dataset. Specify your target dataset with --filepath [dataset filepath]
. You will be
prompted to save the generated schema JSON and to specify a save location. This can be specified
using --output [output filepath]
.
cleanlab dataset schema validate
validates a schema JSON file
Validates a schema JSON file, checking that a schema is complete, well-formatted, and
has data types with sensible feature types. Specify your target schema
with --schema [schema filepath]
.
You may also validate an existing schema with respect to a dataset (-d [dataset filepath]
), i.e. all previously
mentioned checks and the additional check that all fields in the schema are present in the dataset.
cleanlab dataset upload
uploads your dataset
Uploads your dataset to Cleanlab Studio. Specify your target dataset with --filepath [dataset filepath]
. You will be
prompted for further details about the dataset's modality and ID column. These may be supplied to the command
with --modality [modality]
, --id-column [name of ID column]
, and you may also specify a custom dataset name
with--name [custom dataset name]
.
After uploading your dataset, you will be prompted to save the list of dataset issues (if any) encountered during the
upload process. These issues include missing IDs, duplicate IDs, missing values, and values whose types do not match the
schema. You may specify the save location with --output [output filepath]
.
cleanlab cleanset download
downloads Cleanlab columns from your cleanset
Cleansets are initialized through the Cleanlab Studio interface. In a cleanset, users can inspect their dataset and verify their labels. Clean labels are the labels after this set of manual fixes have been applied.
This command downloads the clean labels and saves them locally as a .csv, .xls/.xlsx, or .json, with columns id
and clean_label
. Include the --filepath [dataset filepath]
to combine the clean labels with the original dataset as
a new column clean_label
, which will be outputted to --output [output filepath]
. Include the --all
flag to
include all Cleanlab columns, i.e. issue, label quality, suggested label, clean label, instead of only the clean
label column.
Dataset format
Cleanlab currently only supports text and tabular dataset modalities.
(If your dataset contains both text and tabular data, treat it as tabular.)
The accepted dataset file types are: .csv
, .json
, and .xls/.xlsx
.
Below are some examples of how to format your dataset depending on modality and file type.
Every dataset must have an ID column (i.e. a column containing identifiers that uniquely identify each row) and a label column (for the prediction task).
Apart from the reserved column name: clean_label
, You are free to name the columns in your dataset in any way you
want.
Tabular
.csv, .xls/.xlsx
flower_id | width | length | color | species |
---|---|---|---|---|
flower_01 | 4 | 3 | red | rose |
flower_02 | 7 | 2 | white | lily |
.json
{
"rows": [
{
"flower_id": "flower_01",
"width": 4,
"length": 3,
"color": "red",
"species": "rose"
},
{
"flower_id": "flower_02",
"width": 7,
"length": 2,
"color": "white",
"species": "lily"
}
]
}
Text
.csv, .xls/.xlsx
review_id | review | sentiment |
---|---|---|
review_1 | The sales rep was fantastic! | positive |
review_2 | He was a bit wishy-washy. | negative |
.json
{
"rows": [
{
"review_id": "review_1",
"review": "The sales rep was fantastic!",
"label": "positive"
},
{
"review_id": "review_2",
"review": "He was a bit wishy-washy.",
"label": "negative"
}
]
}
Schema
To specify the column types in your dataset, create a JSON file named schema.json
. We recommend
using cleanlab dataset schema generate
to generate an initial schema and editing from there.
Your schema file should be formatted as follows:
{
"metadata": {
"id_column": "tweet_id",
"modality": "text",
"name": "Tweets.csv"
},
"fields": {
"tweet_id": {
"data_type": "string",
"feature_type": "identifier"
},
"sentiment": {
"data_type": "string",
"feature_type": "categorical"
},
"sentiment_confidence": {
"data_type": "float",
"feature_type": "numeric"
},
"retweet_count": {
"data_type": "integer",
"feature_type": "numeric"
},
"text": {
"data_type": "string",
"feature_type": "text"
},
"tweet_created": {
"data_type": "boolean",
"feature_type": "boolean"
},
"tweet_created": {
"data_type": "string",
"feature_type": "datetime"
},
},
"version": "0.1.12"
}
This is the schema of a hypothetical dataset Tweets.csv
that contains tweets, where the column tweet_id
contains a
unique identifier for each record. Each column in the dataset is specified under fields
with its data type and feature
type.
Data types and Feature types
Data type refers to the type of the field's values: string, integer, float, or boolean.
Note that the integer type is strict, meaning floats will be rejected. In contrast, the float type is lenient, meaning integers are accepted. Users should select the float type if the field may include float values. Note too that integers can have categorical and identifier feature types, whereas floats cannot.
For booleans, the list of accepted values are: true/false, t/f, yes/no, and 1/0.
Feature type refers to the secondary type of the field, relating to how it is used in a machine learning model, such as whether it is:
- a categorical value
- a numeric value
- a datetime value
- a boolean value
- text
- an identifier — a string / integer that identifies some entity
Some feature types can only correspond to specific data types. The list of possible feature types for each data type is shown below
Data type | Feature type |
---|---|
string | text, categorical, datetime, identifier |
integer | categorical, datetime, identifier, numeric |
float | datetime, numeric |
boolean | boolean |
The datetime
type should be used for datetime strings, e.g. "2015-02-24 11:35:52 -0800", and Unix timestamps (which
will be integers or floats). Datetime values must be parsable
by pandas.to_datetime().
version
indicates the version of the Cleanlab CLI package version used to generate the schema. The current Cleanlab
schema version is 0.1.14
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cleanlab-cli-0.1.14.tar.gz
.
File metadata
- Download URL: cleanlab-cli-0.1.14.tar.gz
- Upload date:
- Size: 34.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06167fff1211b3f8245406f198b6feaaa3bc97266501a21d0631b3239eaf0bf1 |
|
MD5 | 62c784825a5d46edebfd5f9c46a83463 |
|
BLAKE2b-256 | 0b71cc75c29c07e5b300a40c3eee452d86b39ac3989e3973ba0ec0de668d3147 |
File details
Details for the file cleanlab_cli-0.1.14-py3-none-any.whl
.
File metadata
- Download URL: cleanlab_cli-0.1.14-py3-none-any.whl
- Upload date:
- Size: 40.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14c6440aa78577ae323a9979a992605455331dd16ae31c8a876772a022ca2ce8 |
|
MD5 | 3ed2b316f745c4fa7eec1fef5848d124 |
|
BLAKE2b-256 | bda9bca45902def22492e33a531566e74abdeb885cad4f8be7e108fc0601da9f |