Goldener - Make your data even more valuable

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Yann-CV

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
Programming Language
- Python :: 3.13
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

A python library orchestrating data during the full life cycle of machine learning pipelines.

Overview | Principles | Features | Installation | Contribute

Overview

Goldener is an open-source Python library (Apache 2 licence) designed to manage the orchestration of data (sampling, splitting) during the full life cycle of machine learning (ML) pipelines.

In the artificial intelligence (AI) era, the data is the new gold. Being able to collect it is already something but creating value from it is the real challenge. Goldener is designed to help to make the most of the available data. It provides tools to orchestrate data during the full life cycle of machine learning pipelines, from the training phase to the monitoring phase.

Goldener makes the right data available at the right time, allowing to optimize the performance of any ML pipelines while minimizing the costs (time, performance, computing resources) of data sampling and labeling.

When it's time to annotate data, Goldener find the most representative subset to annotate. During annotation, it can help to define annotation guidelines by spotting specific cases or as well run annotation quality checks. Once enough data is annotated, Goldener can split it in multiple sets (train, validation, test) ensuring the reproduction of the task variability. During the training phase, Goldener can balance efficiently the data to optimize the training time and the model performance. Finally, when the model is deployed, Goldener can find the most informative data to monitor the model performance and detect any drift in the data distribution.

Key design principles

Goldener is designed to process large datasets efficiently. It is built on the assumption that every AI lifecycle is most of the time iterative and incremental. Its design principles are:

Progressive batch processing: Each task can be stopped and restarted on demand (or failure). Already computed results are not recomputed.
Multipurposes embeddings: The same embeddings are used for the different for different tasks (selection, splitting, monitoring, etc.).
Modality-agnostic: The same tool is actionable for any data modalities (text, image, video, tabular, etc.) and even for multimodality data.

This is not yet applied but for the next iterations, the following principles will be as well followed:

Distributed first: Any task can be distributed across multiple machines.
On demand access to pipelines: All processing pipelines are serializable. They are stored and available whenever a new request is made.

Example of features

Sampling among not annotated data

Goldener can find the most representative data subset to annotate. It can extract and store semantic knowledge of the data from embeddings extracted with pre-trained models. Then, it leverages this knowledge to find the most representative subset of data to annotate. This subset of data can be annotated in order to train or monitor a model.

from goldener import (
    GoldSelector,
    GoldDescriptor,
    GoldTorchEmbeddingTool,
    GoldTorchEmbeddingToolConfig,
    TensorVectorizer,
)

gd = GoldDescriptor(
    table_path="my_table_for_description",
    embedder=GoldTorchEmbeddingTool(
        GoldTorchEmbeddingToolConfig(
            model=my_model,
            layers=my_layers,
        )
    ),
    vectorizer=TensorVectorizer()
)

gs = GoldSelector(
    table_path="my_table_for_selection", selection_key="selection"
)

description = gd.describe_in_table(dataset)
selection_table = gs.select_in_table(description, 100, "to_annotate")
selected = GoldSelector.get_selection_indices(selection_table, "to_annotate", "selection")

Splitting annotated data in train and validation sets

Goldener can split data between the train and validation sets ensuring that the training set is containing most of the different situations for the tasks. From a description of the samples (embeddings), the most different/unique elements are kept for the training set while the least informative ones are kept for the validation set.

from goldener import (
    GoldSet,
    GoldSplitter,
    GoldDescriptor,
    GoldSelector,
)

gd = GoldDescriptor(...) # reuse the descriptor used for smart sampling
gselector = GoldSelector(...)
gs = GoldSplitter(
    sets=[GoldSet("train", 0.7), GoldSet("val", 0.3)],
    descriptor=gd,
    selector=gselector,
)

split_table = gs.split_in_table(dataset)
splits = gs.get_split_indices(
    split_table, selection_key="selected", idx_key="idx"
)
train_indices = splits["train"]
val_indices = splits["val"]

Clustering data to define annotation guidelines

Among the data, there are often multiple "modes" (e.g. different types of images, different types of text, etc.). Goldener can clusterize the data to find these different modes. Then, the different clusters can be leveraged to define annotation guidelines for each cluster.

from goldener import (
    GoldClusterizer,
    GoldSKLearnClusteringTool,
    GoldDescriptor,
    GoldTorchEmbeddingTool,
    GoldTorchEmbeddingToolConfig,
    TensorVectorizer,
)
from sklearn.cluster import KMeans

gd = GoldDescriptor(...) # reuse the descriptor used for smart sampling
gcluster = GoldClusterizer(
    table_path="my_table_for_clusterization",
    clustering_tool=GoldSKLearnClusteringTool(KMeans(n_clusters=10)),
    cluster_key="cluster",
)

description = gd.describe_in_table(dataset)
clustered_table = gcluster.clusterize_in_table(description)

for cluster_id in range(10):
    cluster_indices = get_cluster_indices(clustered_table, "cluster", cluster_id)

    # sample few samples and use them to define annotation guidelines for this cluster

Installation

Installing Goldener is as simple as running the following command:

pip install goldener

Contribute

We welcome contributions to Goldener! Here's how you can help:

Getting Started

Fork the repository
Clone your fork
Install the dependencies
Create your branch and make your proposals
Push to your fork and create a pull request
The PR will be automatically tested by GitHub Actions
A maintainer will review your PR and may request changes
Once approved, your PR will be merged

Development

To set up the development environment:

Install uv if you haven't already:

curl -LsSf https://astral.sh/uv/install.sh | sh

Create and activate a virtual environment (optional but recommended):

uv venv
source .venv/bin/activate  # On Unix/macOS

Install development dependencies:

uv sync --all-extras  # Install all dependencies including development dependencies

Run tests:

uv run pytest .

Run type checking with mypy:

uv run mypy .

Run linting with ruff:

# Run all checks
uv run ruff check .

# Format code
uv run ruff format .

Set up pre-commit hooks:

# Install git hooks
uv run pre-commit install

# Run pre-commit on all files
uv run pre-commit run --all-files

The pre-commit hooks will automatically run:

mypy for type checking
ruff for linting and formatting
pytest for tests

whenever you make a commit.

Release Process

To release a new version of the goldener package:

Create a new branch for the release: git checkout -b release-vX.Y.Z
Update the version vX.Y.Z in pyproject.toml
Run uv sync to update the lock file with the new version
Commit the changes with a message like release vX.Y.Z
Merge the branch into main
Trigger a new release on GitHub with the tag vX.Y.Z

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Yann-CV

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
Programming Language
- Python :: 3.13
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

6.6.0

May 10, 2026

6.5.0

May 6, 2026

6.4.0

May 4, 2026

6.3.1

Apr 15, 2026

6.3.0

Mar 15, 2026

This version

6.2.0

Mar 11, 2026

6.1.0

Mar 10, 2026

6.0.0

Mar 8, 2026

5.0.0

Mar 4, 2026

4.4.1

Feb 25, 2026

4.4.0

Feb 24, 2026

4.3.0

Feb 24, 2026

4.2.0

Feb 23, 2026

4.1.0

Feb 23, 2026

4.0.0

Feb 23, 2026

3.2.0

Feb 17, 2026

3.1.0

Feb 13, 2026

3.0.0

Feb 9, 2026

2.4.0

Feb 7, 2026

2.3.0

Jan 13, 2026

2.2.0

Dec 24, 2025

2.1.0

Dec 19, 2025

2.0.0

Dec 19, 2025

1.1.1

Dec 14, 2025

1.1.0

Dec 12, 2025

1.0.0

Dec 11, 2025

0.1.1

Nov 12, 2025

0.1.0

Oct 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goldener-6.2.0.tar.gz (92.8 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

goldener-6.2.0-py3-none-any.whl (71.7 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file goldener-6.2.0.tar.gz.

File metadata

Download URL: goldener-6.2.0.tar.gz
Upload date: Mar 11, 2026
Size: 92.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for goldener-6.2.0.tar.gz
Algorithm	Hash digest
SHA256	`89013ed37c0f3c9f740c7515ea3e49752348e2a4f257ecc07243d5d5ba6f750e`
MD5	`e491dc00daa503d8b4fade8197d356ee`
BLAKE2b-256	`6219d18c664433c990f7c0f0ad61c8efb2081eb6d712e58e364df3a92d44bec9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for goldener-6.2.0.tar.gz:

Publisher: release.yaml on goldener-data/goldener

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: goldener-6.2.0.tar.gz
- Subject digest: 89013ed37c0f3c9f740c7515ea3e49752348e2a4f257ecc07243d5d5ba6f750e
- Sigstore transparency entry: 1086386744
- Sigstore integration time: Mar 11, 2026
Source repository:
- Permalink: goldener-data/goldener@b8f2dae8997b6308f8582a259d15b42ca9b40cb9
- Branch / Tag: refs/tags/v6.2.0
- Owner: https://github.com/goldener-data
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@b8f2dae8997b6308f8582a259d15b42ca9b40cb9
- Trigger Event: release

File details

Details for the file goldener-6.2.0-py3-none-any.whl.

File metadata

Download URL: goldener-6.2.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 71.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for goldener-6.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`68ef00c474b5a8ba4c13affd94ebeedc5ac056b936c77ac97ffaedbb10d29154`
MD5	`9565475d5129c13263c4877bf8efc141`
BLAKE2b-256	`f1999994987cbd494f400cb65a5f1c368ceeb27711dfdadb9881770522f01bea`

See more details on using hashes here.

Provenance

The following attestation bundles were made for goldener-6.2.0-py3-none-any.whl:

Publisher: release.yaml on goldener-data/goldener

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: goldener-6.2.0-py3-none-any.whl
- Subject digest: 68ef00c474b5a8ba4c13affd94ebeedc5ac056b936c77ac97ffaedbb10d29154
- Sigstore transparency entry: 1086386808
- Sigstore integration time: Mar 11, 2026
Source repository:
- Permalink: goldener-data/goldener@b8f2dae8997b6308f8582a259d15b42ca9b40cb9
- Branch / Tag: refs/tags/v6.2.0
- Owner: https://github.com/goldener-data
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@b8f2dae8997b6308f8582a259d15b42ca9b40cb9
- Trigger Event: release

goldener 6.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Overview

Key design principles

Example of features

Sampling among not annotated data

Splitting annotated data in train and validation sets

Clustering data to define annotation guidelines

Installation

Contribute

Getting Started

Development

Release Process

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance