Goldener - Make your data even more valuable

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Yann-CV

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
Programming Language
- Python :: 3.13
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

goldener

Goldener - Make your data even more valuable

Goldener is an open-source Python library (Apache 2 licence) designed to manage the orchestration of data (sampling, splitting and labeling) during the full life cycle of machine learning (ML) pipelines.

In the artificial intelligence (AI) era, the data is the new gold. Being able to collect it is already something. However, using it blindly is for sure costly: annotation cost, storage cost, training cost. Goldener is aiming to reduce all these costs by optimizing:

Data sampling and splitting: sample the right data ensuring just enough representativeness for the task.
Data labeling: Ensure high-quality data at scale with optimized human in loop processes.

Sampling and labeling in AI life cycle

Successful machine learning pipelines are all about data. All along the AI life cycle, getting access to data fully modelizing the target task is key. Both training and test data are then crucial to the success of the ML pipelines and are continuously updated to ensure the pipeline performances during its whole usage:

Training: The training data defines the model ability to succeed its task. Training from not representative enough data makes the model unable to learn the task adequately. In the meantime, using too much data will slow down the training process. Once the model is deployed, selecting new training data efficiently is as well crucial to solve data drift issues.
Test: The test data drives the design and validation of tested pipelines. Testing from not representative enough data ends up with bad design decision, hence to poor performances in production. Once the model is deployed, selecting new testing data is as well crucial to monitor the model performances and ensure it succeeds its task.

In the meantime, all the sampled data is required to be labelled in order to be used during the ML lifecycle, at least for the test/monitoring sets in case of unsupervised learning. Labeling data is costly and time-consuming, especially when it comes to large datasets. Most of the time, it involves an iterative process including human labelers potentially helped by some AI tools. An efficient data sampling before labeling allows to optimize the time and cost to access new labeled data. In the meantime, depending on the task and target, the bad quality of the labeled data can:

Lead to a model unable to learn the task adequately (wrong labels pushing it in the wrong direction)
Lead to wrong design decisions during the model validation phase (wrong test data leading to wrong conclusions)

The goal of Goldener is to provide the orchestration ensuring the access to high-quality and representative enough data during the whole life cycle of the ML pipelines. With Goldener, the users get the right data at the right time, ensuring the best performances of the ML pipelines while minimizing the costs of data sampling and labeling.

Sampling and labeling during AI lifecycle

Goldener for sampling and labeling orchestration

As a gardener exploiting the most of a good ground, Goldener aims to make the most of your gold (data) and make it even more valuable. Mainly, Goldener features a set of tools to help you to:

Gold prospection: Sample and split the most valuable gold (data)
- Sample among raw data: Spot the data allowing to train a pipeline for a task, or spot weaknesses of a deployed pipeline while minimizing the need for labeling.
- Split labeled data: Ensure enough representativeness in both the training and test sets while optimizing the training process in effectiveness and time.
Gold refining: Ensure the gold (data) quality
- Assist in the labeling process: Make human labeling faster with some smart labeling tools (for instance create image segmentation masks from a single click).
- Label data automatically: Propose labels for raw data based on foundation models or existing labeled data.
- Curate newly labeled data: Identify potential labeling mistakes allowing humans labelers to converge toward high quality labeled data.

Key design principles

Goldener is designed to process large datasets efficiently and effectively. It is built on the assumption that every AI lifecycle is most of the time iterative and incremental. Its design principles are:

Progressive batch processing: Each task can be stopped and restarted on demand (or failure). Already computed results are not recomputed.
Distributed first: Any task can be distributed across multiple machines.
On demand access to pipelines: All processing pipelines are serializable. They are stored and available whenever a new request is made.
Multipurposes embeddings: Whenever it is possible, the same embeddings are used for the different prospection and refining actions on the same data.

To orchestrate both the sampling and labeling of data in Goldener, the same data is moving from steps to steps during the AI lifecycle. In addition, the information gathered all along the cycle is leveraged to drive the efficiency of the next sampling and labeling. Thus, all the data is cached behind the scene and accessible any time.

Current focus

Goldener is a work in progress and is currently in the early stages of development. The current focus is on releasing and validating the first feature around data splitting. Thus, for now the features are not runnable with a distributed workflow. Hopefully, we will get to it soon.

Main features

GoldFeatureExtractor: Extract embeddings/features of different layers from data. We added ways to fuse features from multiple layers to get richer representations.
GoldDescriptor: Extract features/embeddings of a full dataset and store them locally.
GoldSelector: Select a subset of data from a dataset based on the features extracted from a model. The selection is optimized to ensure representativeness while minimizing redundancy.
GoldSplitter: Split the data of a dataset in multiple split based on the repartition of the features extracted from a model. The splits are optimized to ensure representativeness while minimizing redundancy.

Installation

pip install goldener

Contributing

We welcome contributions to Goldener! Here's how you can help:

Getting Started

Fork the repository
Clone your fork
Install the dependencies
Create your branch and make your proposals
Push to your fork and create a pull request
The PR will be automatically tested by GitHub Actions
A maintainer will review your PR and may request changes
Once approved, your PR will be merged

Development

To set up the development environment:

Install uv if you haven't already:

curl -LsSf https://astral.sh/uv/install.sh | sh

Create and activate a virtual environment (optional but recommended):

uv venv
source .venv/bin/activate  # On Unix/macOS

Install development dependencies:

uv sync --all-extras  # Install all dependencies including development dependencies

Run tests:

uv run pytest .

Run type checking with mypy:

uv run mypy .

Run linting with ruff:

# Run all checks
uv run ruff check .

# Format code
uv run ruff format .

Set up pre-commit hooks:

# Install git hooks
uv run pre-commit install

# Run pre-commit on all files
uv run pre-commit run --all-files

The pre-commit hooks will automatically run:

mypy for type checking
ruff for linting and formatting
pytest for tests

whenever you make a commit.

Release Process

To release a new version of the goldener package:

Create a new branch for the release: git checkout -b release-vX.Y.Z
Update the version vX.Y.Z in pyproject.toml
Commit the changes with a message like release vX.Y.Z
Merge the branch into main
Trigger a new release on GitHub with the tag vX.Y.Z

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Yann-CV

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
Programming Language
- Python :: 3.13
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

6.6.0

May 10, 2026

6.5.0

May 6, 2026

6.4.0

May 4, 2026

6.3.1

Apr 15, 2026

6.3.0

Mar 15, 2026

6.2.0

Mar 11, 2026

6.1.0

Mar 10, 2026

6.0.0

Mar 8, 2026

5.0.0

Mar 4, 2026

4.4.1

Feb 25, 2026

4.4.0

Feb 24, 2026

4.3.0

Feb 24, 2026

4.2.0

Feb 23, 2026

4.1.0

Feb 23, 2026

4.0.0

Feb 23, 2026

3.2.0

Feb 17, 2026

3.1.0

Feb 13, 2026

3.0.0

Feb 9, 2026

2.4.0

Feb 7, 2026

2.3.0

Jan 13, 2026

2.2.0

Dec 24, 2025

2.1.0

Dec 19, 2025

2.0.0

Dec 19, 2025

1.1.1

Dec 14, 2025

This version

1.1.0

Dec 12, 2025

1.0.0

Dec 11, 2025

0.1.1

Nov 12, 2025

0.1.0

Oct 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goldener-1.1.0.tar.gz (54.2 kB view details)

Uploaded Dec 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

goldener-1.1.0-py3-none-any.whl (43.8 kB view details)

Uploaded Dec 12, 2025 Python 3

File details

Details for the file goldener-1.1.0.tar.gz.

File metadata

Download URL: goldener-1.1.0.tar.gz
Upload date: Dec 12, 2025
Size: 54.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for goldener-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`eaae14ef05558e1022543780cec92cc066dabcf68fa9c62b8f0f01a06969070b`
MD5	`317d401f8421e5ac4e2753d7bcda4661`
BLAKE2b-256	`79ce8b9e67e05848b31ea6dddcd5c43e33cc6510d74f8ff3cdb302396cb2f52f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for goldener-1.1.0.tar.gz:

Publisher: release.yaml on goldener-data/goldener

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: goldener-1.1.0.tar.gz
- Subject digest: eaae14ef05558e1022543780cec92cc066dabcf68fa9c62b8f0f01a06969070b
- Sigstore transparency entry: 762518912
- Sigstore integration time: Dec 12, 2025
Source repository:
- Permalink: goldener-data/goldener@bfa56ba85b12a57cefa9321f4a5a736e942ec924
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/goldener-data
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@bfa56ba85b12a57cefa9321f4a5a736e942ec924
- Trigger Event: release

File details

Details for the file goldener-1.1.0-py3-none-any.whl.

File metadata

Download URL: goldener-1.1.0-py3-none-any.whl
Upload date: Dec 12, 2025
Size: 43.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for goldener-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51e97605d1508daeae3949c51d6f40ee8769b4b98952cc32f320238b4c09032e`
MD5	`86be30eab70806fa52d94f061d9b4420`
BLAKE2b-256	`8ae4112243cd519272639b30400d7b4eb5ef11f406a634524be6c8bde1440a7d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for goldener-1.1.0-py3-none-any.whl:

Publisher: release.yaml on goldener-data/goldener

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: goldener-1.1.0-py3-none-any.whl
- Subject digest: 51e97605d1508daeae3949c51d6f40ee8769b4b98952cc32f320238b4c09032e
- Sigstore transparency entry: 762518913
- Sigstore integration time: Dec 12, 2025
Source repository:
- Permalink: goldener-data/goldener@bfa56ba85b12a57cefa9321f4a5a736e942ec924
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/goldener-data
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@bfa56ba85b12a57cefa9321f4a5a736e942ec924
- Trigger Event: release

goldener 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Goldener - Make your data even more valuable

Sampling and labeling in AI life cycle

Goldener for sampling and labeling orchestration

Key design principles

Current focus

Main features

Installation

Contributing

Getting Started

Development

Release Process

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance