Skip to main content

Lightly Purple is a lightweight, fast, and easy-to-use data exploration tool for data scientists and engineers.

Project description

The open-source tool curating datasets


PyPI python PyPI version License

๐Ÿš€ Aloha!

We at Lightly created an open-source tool that supercharges your data curation workflows by enabling you to explore datasets, analyze data quality, and improve your machine learning pipelines more efficiently than ever before. Embark with us in this adventure of building better datasets.

๐Ÿ’ป Installation

Please use Python 3.8 or higher with venv.

The library is not OS-dependent and should work on Windows, Linux, and macOS.

# Create a virtual environment
# On Linux/macOS:
python3 -m venv venv
source venv/bin/activate

# On Windows:
python -m venv venv
.\venv\Scripts\activate

# Install library
pip install lightly-purple

Quickstart

Download the dataset and run a quickstart script to load your dataset and launch the app.

YOLO8 dataset example

Here is a quick example using the YOLO8 dataset:

The YOLO format details:
dataset/
โ”œโ”€โ”€ train/
โ”‚   โ”œโ”€โ”€ images/
โ”‚   โ”‚   โ”œโ”€โ”€ image1.jpg
โ”‚   โ”‚   โ”œโ”€โ”€ image2.jpg
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ””โ”€โ”€ labels/
โ”‚       โ”œโ”€โ”€ image1.txt
โ”‚       โ”œโ”€โ”€ image2.txt
โ”‚       โ””โ”€โ”€ ...
โ”œโ”€โ”€ valid/  (optional)
โ”‚   โ”œโ”€โ”€ images/
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ””โ”€โ”€ labels/
โ”‚       โ””โ”€โ”€ ...
โ””โ”€โ”€ data.yaml

Each label file should contain YOLO format annotations (one per line):

<class> <x_center> <y_center> <width> <height>

Where coordinates are normalized between 0 and 1.

On Linux/MacOS:

# Download and extract dataset
export DATASET_PATH=$(pwd)/example-dataset && \
    bash <(curl -sL https://raw.githubusercontent.com/lightly-ai/gists/refs/heads/main/fetch-dataset.sh) \
 https://universe.roboflow.com/ds/nToYP9Q1ix\?key\=pnjUGTjjba \
        $DATASET_PATH

# Download example script
curl -sL https://raw.githubusercontent.com/lightly-ai/gists/refs/heads/main/example-yolo8.py > example.py

# Run the example script
python example.py

On Windows:

# Download and extract dataset
$DATASET_PATH = "$(Get-Location)\example-dataset"
[System.Environment]::SetEnvironmentVariable("DATASET_PATH", $DATASET_PATH, "Process")
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/lightly-ai/gists/refs/heads/main/fetch-dataset.ps1" -OutFile "fetch-dataset.ps1"
.\fetch-dataset.ps1 "https://universe.roboflow.com/ds/nToYP9Q1ix?key=pnjUGTjjba" "$DATASET_PATH"

# Download example script
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/lightly-ai/gists/refs/heads/main/example-yolo8.py" -OutFile "example.py"

# Run the example script
python.exe example.py
Quickstart commands explanation
  1. Setting up the dataset path:
  export DATASET_PATH=$(pwd)/example-dataset

This creates an environment variable DATASET_PATH pointing to an 'example-dataset' folder in your current directory.

  1. Downloading and extracting the dataset:
  bash <(curl -sL https://raw.githubusercontent.com/lightly-ai/gists/refs/heads/main/fetch-dataset.sh)
  • Downloads a shell script that handles dataset fetching
  • The script downloads a YOLO-format dataset from Roboflow
  • Automatically extracts the dataset to your specified DATASET_PATH
  1. Getting the example code:
  curl -sL https://raw.githubusercontent.com/lightly-ai/gists/refs/heads/main/example-yolo8.py > example.py

Downloads a Python script that demonstrates how to:

  • Load the YOLO dataset
  • Process the images and annotations
  • Launch the Lightly Purple UI for exploration
  1. Running the example:
  python example.py

Executes the downloaded script, which will:

  • Initialize the dataset processor
  • Load and analyze your data
  • Start a local server
  • Open the UI in your default web browser

Example explanation

Let's break down the example.py script to explore the dataset:

# We import the DatasetLoader class from the lightly_purple module
from lightly_purple import DatasetLoader

# Create a DatasetLoader instance
loader = DatasetLoader()

# We point to the yaml file describing the dataset
# and the input images subfolder.
# We use train subfolder.
loader.from_yolo(
    "dataset/data.yaml",
    "train",
)

# We start the UI application
loader.launch()

COCO dataset example

Here is an example using the COCO dataset:

The COCO format details:
dataset/
โ”œโ”€โ”€ train/                   # Image files used to train
โ”‚   โ”œโ”€โ”€ image1.jpg
โ”‚   โ”œโ”€โ”€ image2.jpg
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ _annotations.coco.json        # Single JSON file containing all annotations

COCO uses a single JSON file containing all annotations. The format consists of three main components:

  • Images: Defines metadata for each image in the dataset.
  • Categories: Defines the object classes.
  • Annotations: Defines object instances.

On Linux/MacOS:

# Download and extract dataset
export DATASET_PATH=$(pwd)/example-dataset/train && \
    bash <(curl -sL https://raw.githubusercontent.com/lightly-ai/gists/refs/heads/main/fetch-dataset.sh) \
 https://universe.roboflow.com/ds/XU8JobBB7x?key=rpuS7P1Du4 \
        $DATASET_PATH

# Download example script
curl -sL https://raw.githubusercontent.com/lightly-ai/gists/refs/heads/main/example-coco.py > example.py

# Run the example script
python example.py

On Windows:

# Download and extract dataset

Invoke-WebRequest -Uri "https://raw.githubusercontent.com/lightly-ai/gists/refs/heads/main/fetch-dataset.ps1" -OutFile "fetch-dataset.ps1"
.\fetch-dataset.ps1 "https://universe.roboflow.com/ds/XU8JobBB7x?key=rpuS7P1Du4" "$(Get-Location)\example-dataset"

# Download example script
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/lightly-ai/gists/refs/heads/main/example-coco.py" -OutFile "example.py"

$DATASET_PATH = "$(Get-Location)\example-dataset\train"
[System.Environment]::SetEnvironmentVariable("DATASET_PATH", $DATASET_PATH, "Process")
# Run the example script
python.exe example.py

Example explanation

Let's break down the example-coco.py script to explore the dataset:

from lightly_purple import DatasetLoader

# Create a DatasetLoader instance
loader = DatasetLoader()

# We point to the annotations json file and the input images folder.
# Defined dataset is processed here to be available for the UI application.
loader.from_coco_instance_segmentations(
    "dataset/_annotations.coco.json",
    "dataset/train",

# We start the UI application
loader.launch()

๐Ÿ” How it works

Let's describe a little bit in detail what is happening under the hood:

In our library, we emulated a full-fledged environment to process your data and make it available for the UI application.

  • Dataset Loader: The Python module is responsible for processing the dataset.

    • Processes given dataset.
    • Stores it in the persistent data storage layer.
    • Handling various data formats and annotation types.
  • Data Storage Layer: Stores information about the dataset:

    • After the dataset is processed information about the dataset is stored in the persistent database.
    • We use duckdb database as a persistent storage layer, you will see purple.db file after the dataset is processed.
  • Backend API: Python web server that serves the dataset to the UI application.

    • Uses the persistent data storage layer to serve the dataset to the UI application.
    • Manages user interactions with the data
  • UI Application: A responsive web interface:

    • Running on your local machine on 8001 port and available at http://localhost:8001/.
    • It opens automatically after the dataset is processed.
    • Consumes local API endpoints
    • Visualizes your dataset and analysis results

๐Ÿ“ฆ Dataset Formats

Our library supports the following dataset formats:

  • YOLO8
  • COCO object detection
  • COCO binary mask instance segmentation

๐Ÿ“š FAQ

Are the datasets persistent?

Yes, the information about datasets is persistent and stored in the db file. You can see it after the dataset is processed. If you rerun the loader it will create a new dataset representing the same dataset, keeping the previous dataset information untouched.

Can I change the database path?

Not yet. The database is stored in the working directory by default.

Can I launch in another Python script or do I have to do it in the same script?

It is possible to use only one script at the same time because we lock the db file for the duration of the script.

Can I change the API backend port?

Currently, the API always runs on port 8001, and this cannot be changed yet.

Can I process datasets that do not have annotations?

No, we support only datasets with annotations now.

What dataset annotations are supported?

Bounding boxes are supported โœ…

Instance segmentation is supported โœ…

Custom metadata is NOT yet supported โŒ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightly_purple-0.2.14.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightly_purple-0.2.14-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file lightly_purple-0.2.14.tar.gz.

File metadata

  • Download URL: lightly_purple-0.2.14.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.4

File hashes

Hashes for lightly_purple-0.2.14.tar.gz
Algorithm Hash digest
SHA256 0791d0b3a16bf54b569aff195cf630b8781c4c444bf8f2204d28fda3cbf263df
MD5 bb6660b8f38516dbb71dfd314871c8c8
BLAKE2b-256 266baa0ece60e1ca23071dcd4409d220b4f5327e44cd30c13377c0b3bcba3cc4

See more details on using hashes here.

File details

Details for the file lightly_purple-0.2.14-py3-none-any.whl.

File metadata

File hashes

Hashes for lightly_purple-0.2.14-py3-none-any.whl
Algorithm Hash digest
SHA256 217b2d32714c1b30dcaad5e0b2daa15614a27e55b761b9d4e088562c93b28c56
MD5 e015d1075323411c08382446205a0d14
BLAKE2b-256 f653c0e5a785669241d91c727d07dbc0265cbb4d4a1c53953e7a2786f12a8a9e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page