Skip to main content

LightlyStudio is a lightweight, fast, and easy-to-use data exploration tool for data scientists and engineers.

Project description

The open-source tool curating datasets


PyPI python PyPI version License

๐Ÿš€ Welcome to LightlyStudio!

We at Lightly created LightlyStudio, an open-source tool designed to supercharge your data curation workflows for computer vision datasets. Explore your data, visualize annotations and crops, tag samples, and export curated lists to improve your machine learning pipelines. And much more!

LightlyStudio runs entirely locally on your machine, keeping your data private. It consists of a Python library for indexing your data and a web-based UI for visualization and curation.

โœจ Core Workflow

Using LightlyStudio typically involves these steps:

  1. Index Your Dataset: Run a Python script using the lightly_studio library to process your local dataset (images and annotations) and save metadata into a local lightly_studio.db file.
  2. Launch the UI: The script then starts a local web server.
  3. Explore & Curate: Use the UI to visualize images, annotations, and object crops. Filter and search your data (experimental text search available). Apply tags to interesting samples (e.g., "mislabeled", "review").
  4. Export Curated Data: Export information (like filenames) for your tagged samples from the UI to use downstream.
  5. Stop the Server: Close the terminal running the script (Ctrl+C) when done.

LightlyStudio Sample Grid View
Visualize your dataset samples with annotations in the grid view.

LightlyStudio Annotation Crop View
Switch to the annotation view to inspect individual object crops easily.

LightlyStudio Sample Detail View
Inspect individual samples in detail, viewing all annotations and metadata.

๐ŸŽฏ Features

  • Local Web GUI: Explore and curate your dataset in your browser. Works completely offline, your data never leaves your machine.
  • Flexible Input Formats: Load your image dataset from a folder, or with annotations from a number of popular formats like e.g. COCO or YOLO.
  • Metadata: Attach your custom metadata to every sample.
  • Tags: Mark subsets of your dataset for later use.
  • Embeddings: Run similarity search queries on your data.
  • Selection: Run advanced selection algorithms to tag a subset of your data.

๐Ÿ’ป Installation

Ensure you have Python 3.8 or higher. We strongly recommend using a virtual environment.

The library is OS-independent and works on Windows, Linux, and macOS.

# 1. Create and activate a virtual environment (Recommended)
# On Linux/macOS:
python3 -m venv venv
source venv/bin/activate

# On Windows:
python -m venv venv
.\venv\Scripts\activate

# 2. Install LightlyStudio
pip install lightly_studio

Quickstart

Download the dataset and run a quickstart script to load your dataset and launch the app.

YOLO Object Detection

To run an example using a yolo dataset, clone the example repository and run the example script from below:

git clone https://github.com/lightly-ai/dataset_examples dataset_examples

example_yolo.py script to explore the dataset:

from pathlib import Path

import lightly_studio as ls

data_yaml_path = Path(__file__).resolve().parent / "data.yaml"

# Create a dataset and add the samples from the yolo format
dataset = ls.Dataset.create()
dataset.add_samples_from_yolo(
    data_yaml=data_yaml_path,
    input_split="test",
)

# Start the UI application on the port 8001.
ls.start_gui()
The YOLO format details:
road_signs_yolo/
โ”œโ”€โ”€ train/
โ”‚   โ”œโ”€โ”€ images/
โ”‚   โ”‚   โ”œโ”€โ”€ image1.jpg
โ”‚   โ”‚   โ”œโ”€โ”€ image2.jpg
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ””โ”€โ”€ labels/
โ”‚       โ”œโ”€โ”€ image1.txt
โ”‚       โ”œโ”€โ”€ image2.txt
โ”‚       โ””โ”€โ”€ ...
โ”œโ”€โ”€ valid/  (optional)
โ”‚   โ”œโ”€โ”€ images/
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ””โ”€โ”€ labels/
โ”‚       โ””โ”€โ”€ ...
โ””โ”€โ”€ data.yaml

Each label file should contain YOLO format annotations (one per line):

<class> <x_center> <y_center> <width> <height>

Where coordinates are normalized between 0 and 1.

COCO Instance Segmentation

To run an instance segmentation example using a COCO dataset, clone the example repository and run the example script from below:

git clone https://github.com/lightly-ai/dataset_examples dataset_examples

example_coco.py script to explore the dataset:

from pathlib import Path

import lightly_studio as ls

current_dir = Path(__file__).resolve().parent

# Create a dataset and add the samples from the coco format
dataset = ls.Dataset.create()
dataset.add_samples_from_coco(
    annotations_json=current_dir / "instances_train2017.json",
    images_path=current_dir / "images",
    annotation_type=ls.AnnotationType.INSTANCE_SEGMENTATION,
)

# Start the UI application on the port 8001.
ls.start_gui()
The COCO format details:
coco_subset_128_images/
โ”œโ”€โ”€ images/
โ”‚   โ”œโ”€โ”€ image1.jpg
โ”‚   โ”œโ”€โ”€ image2.jpg
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ instances_train2017.json        # Single JSON file containing all annotations

COCO uses a single JSON file containing all annotations. The format consists of three main components:

  • Images: Defines metadata for each image in the dataset.
  • Categories: Defines the object classes.
  • Annotations: Defines object instances.

๐Ÿ” How It Works

  1. Your Python script uses the lightly_studio Dataset.
  2. The dataset.add_samples_from_<source> reads your images and annotations, calculates embeddings, and saves metadata to a local lightly_studio.db file (using DuckDB).
  3. lightly_studio.start_gui() starts a local Backend API server.
  4. This server reads from lightly_studio.db and serves data to the UI Application running in your browser (http://localhost:8001).
  5. Images are streamed directly from your disk for display in the UI.

๐ŸŽฏ Python Interface

Dataset

Load Images From A Folder

import lightly_studio as ls

dataset = ls.Dataset.create()
dataset.add_samples_from_path(path="/path/to/image_dataset")

ls.start_gui()

Load Images With Annotations

The Dataset currently supports:

  • YOLOv8 Object Detection: Reads .yaml file. Supports bounding boxes.
  • COCO Object Detection: Reads .json annotations. Supports bounding boxes.
  • COCO Instance Segmentation: Reads .json annotations. Supports instance masks in RLE (Run-Length Encoding) format.
# Load a dataset in YOLO format
import lightly_studio as ls

dataset = ls.Dataset.create()
dataset.add_samples_from_yolo(
    data_yaml="my_yolo_dataset/data.yaml",
    input_split="val",
)

ls.start_gui()
# Load an object detection/instance segmentation dataset in COCO format
import lightly_studio as ls

dataset = ls.Dataset.create()
dataset.add_samples_from_coco(
    annotations_json="my_coco_dataset/detections_train.json",
    images_path="my_coco_dataset/images",
    # If using instance segmentation, uncomment the next line.
    # annotation_type=ls.AnnotationType.INSTANCE_SEGMENTATION,
)

ls.start_gui()

Load an Existing Dataset

It is also possible to load an existing dataset by

import lightly_studio as ls

dataset = ls.Dataset.load_or_create()

This will load the dataset if it does exist in the .db file, else it will create a new dataset.

Samples

The dataset consists of samples. Every sample corresponds to an image. Dataset samples can be fetched and accessed as follows, for a full list of attributes see sample.

# Get all dataset samples
samples = list(dataset)

# Access sample attributes
s = samples[0]
s.sample_id        # Sample ID
s.file_name        # Image file name
s.file_path_abs    # Full image file path
s.tags             # The list of sample tags
s.metadata["key"]  # dict-like access for metadata

# Set sample attributes
s.tags = {"tag1", "tag2"}
s.metadata["key"] = 123

# Adding/removing tags
s.add_tag("some_tag")
s.remove_tag("some_tag")

...

Dataset Query

You can efficiently fetch filtered dataset samples with a DatasetQuery() object. To get a query for a existing dataset:

query = dataset.query()

By defining the match, order_by, and slice for a query, the intended filtering is set. If one of them is not required, they can be skipped.

When the query is used to fetch samples, the order of execution is:

  1. match
  2. order_by
  3. slice

Example Query Usage

from lightly_studio.core.dataset_query.boolean_expression import OR
from lightly_studio.core.dataset_query.order_by import OrderByField
from lightly_studio.core.dataset_query.sample_field import SampleField

query = dataset.match(
        OR(
            SampleField.file_name == "a",
            SampleField.file_name == "b",
        )
    ).order_by(
        OrderByField(SampleField.width).desc()
    ).slice(offset=10, limit=10)

query.add_tag("query_result")
Advanced Example:
from lightly_studio.core.dataset_query.boolean_expression import AND, OR, NOT
from lightly_studio.core.dataset_query.order_by import OrderByField
from lightly_studio.core.dataset_query.sample_field import SampleField

query = dataset.match(
    OR(
        SampleField.file_name == "a",
        SampleField.file_name == "b",
        AND(
            SampleField.width > 10,
            SampleField.width < 20,
            NOT(SampleField.tags.contains("dog")),
        ),
    )
).order_by(
    OrderByField(SampleField.width).desc()
).slice(offset=10, limit=10)

query.add_tag("query_result")

for sample in query:
    print(sample.tags)

Define the Query: match

The filtering for a query can be set by:

query = query.match(expression)

To create an expression for filtering on certain sample fields, the SampleField.<field_name> <operator> <value> syntax can be used. Available field names can be seen in SampleField.

SampleField Examples:
from lightly_studio.core.dataset_query.sample_field import SampleField

# Ordinal fields: <, <=, >, >=, ==, !=

expr = SampleField.height >= 10            # All samples with images that are taller than 9 pixels
expr = SampleField.width == 10             # All samples with images that are exactly 10 pixels wide
expr = SampleField.created_at > datetime   # All samples created after datetime (actual datetime object)

# String fields: ==, !=
expr = SampleField.file_name == "some"     # All samples with "some" as file name
expr = SampleField.file_path_abs != "other" # All samples that are not having "other" as file_path

# Tags: contains()
expr = SampleField.tags.contains("dog")    # All samples that contain the tag "dog"

# Assign any of the previous expressions to a query:
query = query.match(expr)

The filtering on individual fields can flexibly be combined to create more complex match expression. For this, the boolean operators AND, OR, and NOT are available. Boolean operators can arbitrarily be nested.

Boolean Examples:
from lightly_studio.core.dataset_query.boolean_expression import AND, OR, NOT
from lightly_studio.core.dataset_query.sample_field import SampleField

# All samples with images that are between 10 and 20 pixels wide
expr = AND(
    SampleField.width > 10,
    SampleField.width < 20
)

# All samples with file names that are either "a" or "b"
expr = OR(
    SampleField.file_name == "a",
    SampleField.file_name == "b"
)

# All samples which do not contain a tag "dog"
expr = NOT(SampleField.tags.contains("dog"))

# All samples for a nested expression
expr = OR(
    SampleField.file_name == "a",
    SampleField.file_name == "b",
    AND(
        SampleField.width > 10,
        SampleField.width < 20,
        NOT(
            SampleField.tags.contains("dog")
        ),
    ),
)

# Assign any of the previous expressions to a query:
query = query.match(expr)

Define the Query: order_by

Setting the sorting of a query can done by

query = query.order_by(expression)

The order expression can be defined by OrderByField(SampleField.<field_name>).<order_direction>().

OrderByField Examples:
from lightly_studio.core.dataset_query.order_by import OrderByField
from lightly_studio.core.dataset_query.sample_field import SampleField

# Sort the query by the width of the image in ascending order
expr = OrderByField(SampleField.width)
expr = OrderByField(SampleField.width).asc()

# Sort the query by the height of the image in descending order
expr = OrderByField(SampleField.file_name).desc()

# Assign any of the previous expressions to a query:
query = query.order_by(expr)

Define the Query: slice

Setting the slicing of a query can done by:

query = query.slice(offset, limit)
# OR
query = query[offset:stop]

Both are different syntax for the same operation.

Slice Examples:
# Slice 2:5
query = query.slice(offset=2, limit=3)
query = query[2:5]

# Slice :5
query = query.slice(limit=5)
query = query[:5]

# Slice 5:
query = query.slice(offset=5)
query = query[5:]

Access the Samples

To access the filtered samples two possibilities are available: iterating over the query object or calling the to_list() method.

Iterating over the query:

query = dataset.query().match(match_expression).order_by(order_by_expression).slice(offset,limit)

samples = []
for sample in query:
    samples.append(sample)

Get all samples as list:

query = dataset.query().match(match_expression).order_by(order_by_expression).slice(offset,limit)

samples = query.to_list()

In some use cases, one might want to assign a tag to the samples that are the result of a query:

query.add_tag("tag_name")

Examples

Add Custom Metadata

Attach values to custom fields for every sample.

import lightly_studio as ls

# Load your dataset
dataset = ls.Dataset.create()
dataset.add_samples_from_path(path="/path/to/image_dataset")

# Attach metadata
for sample in dataset:
    sample.metadata["my_metadata"] = f"Example metadata field for {sample.file_name}"
    sample.metadata["my_dict"] = {"my_int_key": 10, "my_bool_key": True}

# View metadata in GUI
ls.start_gui()

Tags

You can easily mark subsets of your data with tags.

import lightly_studio as ls

# Load your dataset
dataset = ls.Dataset.create()
dataset.add_samples_from_path(path="/path/to/image_dataset")

# Tag the first 10 samples:
query = dataset.query()[:10]
query.add_tag("some_tag")

Find existing tags and tagged samples as follows.

import lightly_studio as ls

# Load your dataset
dataset = ls.Dataset.create()
dataset.add_samples_from_path(path="/path/to/image_dataset")

# Get all samples that contain the tag "dog"
query = dataset.query().match(SampleField.tags.contains("dog"))
samples = query.to_list()

Selection

LightlyStudio offers as a premium feature advanced methods for subselecting dataset samples.

Prerequisites: The selection functionality requires a valid LightlyStudio license key. Set the LIGHTLY_STUDIO_LICENSE_KEY environment variable before using selection features:

export LIGHTLY_STUDIO_LICENSE_KEY="license_key_here"

Alternatively, set it inside your Python script:

import os
os.environ["LIGHTLY_STUDIO_LICENSE_KEY"] = "license_key_here"

The selection can be configured directly from a DatasetQuery. The example below showcases a simple case of selecting diverse samples.

import lightly_studio as ls

# Load your dataset
dataset = ls.Dataset.load_or_create()
dataset.add_samples_from_path(path="/path/to/image_dataset")

# Select a diverse subset of 10 samples.
dataset.query().selection().diverse(
    n_samples_to_select=10,
    selection_result_tag_name="diverse_selection",
)

ls.start_gui()

The selected sample paths can be exported via the GUI, or by a script:

import lightly_studio as ls
from lightly_studio.core.dataset_query.sample_field import SampleField

dataset = ls.Dataset.load("my-dataset")
selected_samples = (
    dataset.match(SampleField.tags.contains("diverse_selection")).to_list()
)

with open("export.txt", "w") as f:
    for sample in selected_samples:
        f.write(f"{sample.file_path_abs}\n")

๐Ÿ“š FAQ

Are the datasets persistent?

Yes, the information about datasets is persistent and stored in the db file. You can see it after the dataset is processed. If you rerun the loader it will create a new dataset representing the same dataset, keeping the previous dataset information untouched.

Can I change the database path?

Not yet. The database is stored in the working directory by default.

Can I launch in another Python script or do I have to do it in the same script?

It is possible to use only one script at the same time because we lock the db file for the duration of the script.

Can I change the API backend port?

Yes. To change the port set the LIGHTLY_STUDIO_PORT variable to your preffered value. If at runtime the port is unavailable it will try to set it to a random value.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightly_studio-0.3.2.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightly_studio-0.3.2-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file lightly_studio-0.3.2.tar.gz.

File metadata

  • Download URL: lightly_studio-0.3.2.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for lightly_studio-0.3.2.tar.gz
Algorithm Hash digest
SHA256 50c3f5d87da879fd5721b95b76c38e87e22ae2989904ff4c007e35da9e0bf645
MD5 6926d94fb0c12721cbfc4b766d381e0b
BLAKE2b-256 74722c0163b0c4bb5ac6c272b889a31c251e3dd869b9db812f31742c16a7857b

See more details on using hashes here.

File details

Details for the file lightly_studio-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for lightly_studio-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e3683253c3bc90a8b88a3edf339fc96aea9b3d83c81c7d979287c8093b50c577
MD5 7bfd698fd18c405b02aefab54f55c523
BLAKE2b-256 99cf371ac205651a43b03debe9c15d3dd1c7fc65127afe6235d1fc732d2aa7b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page