Client interface for all things Cleanlab Studio

These details have not been verified by PyPI

Project links

Project description

cleanlab-studio

Command line and Python library interface to Cleanlab Studio. Analyze datasets and produce cleansets (cleaned datasets) with Cleanlab Studio in a single line of code!

Installation
Quickstart
Advanced Usage
Documentation/Tutorials

Installation

You can install the Cleanlab Studio client from PyPI with:

pip install cleanlab-studio

If you already have the client installed and wish to upgrade to the latest version, run:

pip install --upgrade cleanlab-studio

Quickstart

Python API -- run Cleanlab Studio from Python or Jupyter Notebook

You can find your API key at https://app.cleanlab.ai/account.

from cleanlab_studio import Studio

# create your Cleanlab Studio API client with your API key, found here: https://app.cleanlab.ai/account
studio = Studio(<your api key>)

# upload your dataset via a filepath, Pandas DataFrame, or PySpark DataFrame!
dataset_id: str = studio.upload_dataset(<your dataset>, <your dataset name>)

# navigate to Cleanlab Studio, create a project, and improve your labels

# download your cleanset or apply corrections to your local Pandas or PySpark dataset!
# you can find your cleanset ID by clicking on the Export Cleanset button in your project
cleanset = studio.download_cleanlab_columns(<your cleanset id>)
corrected_dataset = studio.apply_corrections(<your dataset>, <your cleanset id>)

CLI -- run Cleanlab Studio from your command line

If this is your first time using the Cleanlab CLI, authenticate with cleanlab login. You can find your API key at https://app.cleanlab.ai/account.
Upload your dataset (image, text, or tabular) using cleanlab dataset upload.
Create a project in Cleanlab Studio.
Improve your dataset in Cleanlab Studio (e.g., correct some labels).
Download your cleanset with cleanlab cleanset download.

Dataset Structure

Cleanlab Studio supports the following upload types:

Text/Tabular
- CSV
- JSON
- XLS/XLSX
- Pandas DataFrame (Python library only)
- PySpark DataFrame (Python library only)
- more to come!
Image
- CSV (external media)
- JSON (external media)
- XLS/XLSX (external media)
- Pandas DataFrame (external media) (Python library only)
- PySpark DataFrame (external media) (Python library only)
- Simple ZIP upload
- Metadata ZIP upload
- more to come!

Information on dataset structuring can be found by clicking the tutorial on https://app.cleanlab.ai/upload!

Advanced Usage

Schema

Python API

All schema information will be inferred by default when uploading a dataset through the Python API. We provide some options to override the inferred schema if necessary:

To override the dataset modality, supply a modality kwarg to studio.upload_dataset(). Supported modalities include "text", "tabular", and "image"
To override the ID column, supply an id_column kwarg to studio.upload_dataset()
To override column types in your dataset, supply a schema_overrides kwarg to studio.upload_dataset() in the following format:

{
  <name_of_column_to_override>: {
    "data_type": <desired_data_type>,
    "feature_type": <desired_feature_type>,
  },
  ...
}

CLI

To specify the column types in your dataset, create a JSON file named schema.json. If you would like to edit an inferred schema (rather than starting from scratch) follow these steps:

Kick off a dataset upload using: cleanlab dataset upload
Once schema generation is complete, you'll be asked whether you'd like to use our inferred schema. Enter n to decline
You'll then be asked whether you'd like to save the inferred schema. Enter y to accept. Then enter the filename you'd like to save to (schema.json by default)
Edit the schema file as you wish
Kick off a dataset upload again using: cleanlab dataset upload --schema_path [path to schema file]

Your schema file should be formatted as follows:

{
  "metadata": {
    "id_column": "tweet_id",
    "modality": "text",
    "name": "Tweets.csv"
  },
  "fields": {
    "tweet_id": {
      "data_type": "string",
      "feature_type": "identifier"
    },
    "sentiment": {
      "data_type": "string",
      "feature_type": "categorical"
    },
    "sentiment_confidence": {
      "data_type": "float",
      "feature_type": "numeric"
    },
    "retweet_count": {
      "data_type": "integer",
      "feature_type": "numeric"
    },
    "text": {
      "data_type": "string",
      "feature_type": "text"
    },
    "tweet_created": {
      "data_type": "boolean",
      "feature_type": "boolean"
    },
    "tweet_created": {
      "data_type": "string",
      "feature_type": "datetime"
    }
  },
  "version": "0.1.12"
}

This is the schema of a hypothetical dataset Tweets.csv that contains tweets, where the column tweet_id contains a unique identifier for each record. Each column in the dataset is specified under fields with its data type and feature type.

Data types and Feature types

Data type refers to the type of the field's values: string, integer, float, or boolean.

Note that the integer type is partially strict, meaning floats that are equal to integers (e.g. 1.0, 2.0, etc) will be accepted, but floats like 0.8 and 1.5 will not. In contrast, the float type is lenient, meaning integers are accepted. Users should select the float type if the field may include float values. Note too that integers can have categorical and identifier feature types, whereas floats cannot.

For booleans, the list of accepted values are: true/false, t/f, yes/no, 1/0, 1.0/0.0.

Feature type refers to the secondary type of the field, relating to how it is used in a machine learning model, such as whether it is:

a categorical value
a numeric value
a datetime value
a boolean value
text
an identifier — a string / integer that identifies some entity
a filepath value (only valid for image datasets)

Some feature types can only correspond to specific data types. The list of possible feature types for each data type is shown below

Data type	Feature type
string	text, categorical, datetime, identifier, filepath
integer	categorical, datetime, identifier, numeric
float	datetime, numeric
boolean	boolean

The datetime type should be used for datetime strings, e.g. "2015-02-24 11:35:52 -0800", and Unix timestamps (which will be integers or floats). Datetime values must be parseable by polars.from_epoch for integer/floats or polars.Expr.str.strptime for strings.

version indicates the version of the Cleanlab CLI package version used to generate the schema.

Other Resources

Documentation -- tutorials on how to use Cleanlab Studio for specific applications
Blog -- example applications, feature announcements, concept explanations, benchmarks
Slack Community -- ask questions, request features, discuss Data-Centric AI with others
Need professional help or want demo? Reach out via email: team@cleanlab.ai

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.5.21

Feb 18, 2025

2.5.20

Feb 18, 2025

2.5.19

Feb 11, 2025

2.5.18

Jan 31, 2025

2.5.17

Jan 31, 2025

2.5.16

Jan 27, 2025

2.5.15

Jan 23, 2025

2.5.14

Jan 23, 2025

2.5.13

Jan 17, 2025

2.5.12

Jan 10, 2025

2.5.11

Jan 9, 2025

2.5.10

Jan 9, 2025

2.5.9

Dec 18, 2024

2.5.8

Oct 30, 2024

2.5.7

Oct 22, 2024

2.5.6

Oct 11, 2024

2.5.5

Oct 7, 2024

2.5.4

Oct 2, 2024

2.5.3

Oct 1, 2024

2.5.2

Oct 1, 2024

2.5.1

Sep 25, 2024

2.5.0

Sep 19, 2024

2.4.4

Sep 10, 2024

2.4.3

Sep 6, 2024

2.4.2

Aug 31, 2024

2.4.1

Aug 27, 2024

2.4.0

Aug 23, 2024

2.3.0

Aug 22, 2024

2.2.1

Aug 21, 2024

2.2.0

Aug 20, 2024

2.1.13

Aug 20, 2024

2.1.12

Aug 20, 2024

2.1.11

Aug 20, 2024

2.1.10

Aug 2, 2024

2.1.9

Aug 1, 2024

2.1.8

Jul 30, 2024

2.1.7

Jul 26, 2024

2.1.6

Jul 26, 2024

2.1.5

Jul 25, 2024

2.1.4

Jul 24, 2024

2.1.3

Jul 18, 2024

2.1.2

Jul 17, 2024

2.1.1

Jul 1, 2024

2.1.0

Jun 26, 2024

2.0.9

Jun 19, 2024

2.0.8

Jun 18, 2024

2.0.7

Jun 5, 2024

2.0.6

May 30, 2024

2.0.5

May 22, 2024

2.0.4

May 6, 2024

2.0.3

May 1, 2024

2.0.2

Apr 19, 2024

2.0.1

Apr 9, 2024

2.0.0

Apr 9, 2024

1.3.2

Apr 5, 2024

1.3.1

Mar 19, 2024

1.3.0

Mar 15, 2024

1.2.5

Mar 12, 2024

1.2.4

Mar 1, 2024

1.2.3

Mar 1, 2024

1.2.2

Feb 27, 2024

1.2.1

Feb 21, 2024

1.2.0

Feb 20, 2024

1.1.29

Feb 13, 2024

1.1.28

Feb 5, 2024

1.1.27

Feb 2, 2024

1.1.26

Feb 1, 2024

1.1.25

Jan 31, 2024

1.1.24

Jan 30, 2024

1.1.23

Jan 25, 2024

1.1.22

Jan 23, 2024

1.1.21

Jan 23, 2024

1.1.20 yanked

Jan 19, 2024

Reason this release was yanked:

TLM broken for non-notebook users

1.1.19

Jan 18, 2024

1.1.18

Jan 11, 2024

1.1.17

Jan 4, 2024

1.1.16

Dec 27, 2023

1.1.15

Dec 22, 2023

1.1.14

Dec 12, 2023

1.1.13

Dec 7, 2023

1.1.12

Dec 5, 2023

1.1.10

Nov 14, 2023

1.1.9

Nov 8, 2023

1.1.8

Oct 19, 2023

1.1.7

Oct 19, 2023

This version

1.1.6

Oct 17, 2023

1.1.5

Oct 16, 2023

1.1.4

Aug 25, 2023

1.1.3

Aug 24, 2023

1.1.2

Aug 22, 2023

1.1.1

Aug 7, 2023

1.1.0

Aug 3, 2023

1.0.15

Jul 25, 2023

1.0.14

Jul 19, 2023

1.0.13

Jul 19, 2023

1.0.12

Jul 1, 2023

1.0.10

Jun 29, 2023

1.0.9

Jun 26, 2023

1.0.8

Jun 16, 2023

1.0.7

Jun 12, 2023

1.0.6

Jun 3, 2023

1.0.5

May 12, 2023

1.0.4

May 11, 2023

1.0.3

May 10, 2023

1.0.2

May 9, 2023

1.0.1

May 5, 2023

1.0.0

May 3, 2023

0.1.35

Apr 19, 2023

0.1.34

Mar 21, 2023

0.1.33

Mar 14, 2023

0.1.32

Mar 10, 2023

0.1.30

Feb 23, 2023

0.1.29

Feb 16, 2023

0.1.27

Feb 16, 2023

0.1.26

Feb 16, 2023

0.1.25

Feb 9, 2023

0.1.24

Feb 7, 2023

0.1.23

Feb 7, 2023

0.1.22

Feb 6, 2023

0.1.21

Dec 28, 2022

0.1.20

Dec 27, 2022

0.1.19

Dec 23, 2022

0.1.18

Dec 1, 2022

0.1.17

Nov 23, 2022

0.1.16

Nov 17, 2022

0.1.15

Oct 25, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanlab-studio-1.1.6.tar.gz (48.1 kB view details)

Uploaded Oct 17, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cleanlab_studio-1.1.6-py3-none-any.whl (61.6 kB view details)

Uploaded Oct 17, 2023 Python 3

File details

Details for the file cleanlab-studio-1.1.6.tar.gz.

File metadata

Download URL: cleanlab-studio-1.1.6.tar.gz
Upload date: Oct 17, 2023
Size: 48.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for cleanlab-studio-1.1.6.tar.gz
Algorithm	Hash digest
SHA256	`e4166842a9395157563ac5ffc59095ee76f36f1310c384550c716737eadaee1e`
MD5	`26008baeb563d7b0da818a74b1330600`
BLAKE2b-256	`82c36b84728828a17b9d0b3afa9bd82492e4e4e0d6bcff88d478554cac62cefd`

See more details on using hashes here.

File details

Details for the file cleanlab_studio-1.1.6-py3-none-any.whl.

File metadata

Download URL: cleanlab_studio-1.1.6-py3-none-any.whl
Upload date: Oct 17, 2023
Size: 61.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for cleanlab_studio-1.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ad74d4bb0211fe452f1e92ce982ed7746c7223838f0fba0a98ac313b2f792f8`
MD5	`096b553225928ed3dfa782d3bea63d1a`
BLAKE2b-256	`410d5ebaa4759063be79d9686611b58a0b1fbeee680f1206c13721b63c94dd96`

See more details on using hashes here.

cleanlab-studio 1.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cleanlab-studio

Installation

Quickstart

Python API -- run Cleanlab Studio from Python or Jupyter Notebook

CLI -- run Cleanlab Studio from your command line

Dataset Structure

Advanced Usage

Schema

Python API

CLI

Data types and Feature types

Other Resources

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes