A package for managing and processing datasets for fairness research in machine learning

Project description

faiground logo

FairML Datasets

A comprehensive Python package for loading, processing, and working with datasets used in fair classification.

Overview

The dataset preprocessing pipeline supported by the package.

FairML Datasets provides tools and interfaces to download, load, transform, and analyze the datasets in the FairGround corpus. It handles sensitive attributes and facilitates fairness-aware machine learning experiments. The package supports the full data processing pipeline from downloading data all the way to splitting data for ML training.

Key Features

📦 Loading: Easily download, load and prepare any of the 44 supported datasets in the corpus.
🗂️ Collections: Conveneiently use any of our prespecified collections which have been developed to maximize diversity in algorithmic performance.
🔄 Multi-Dataset Support: Easily evaluate your algorithm on one scenario, five or fourty using a simple loop.
⚙️ Processing: Automatically apply dataset (pre)processing with configurable choices and defaults available.
📊 Metadata Generation: Automatically calculate rich metadata features for datasets.
💻 Command-line Interface: Access common operations without writing code.

Installation

pip install fairml-datasets

Or using uv:

uv add fairml-datasets

Quick Start

from fairml_datasets import Dataset

# Access a specific dataset by ID directly
dataset = Dataset.from_id("folktables_acsincome_small")

# Load the dataset
df = dataset.load()

# Check sensitive attributes
print(f"Sensitive columns: {dataset.sensitive_columns}")

# Transform the dataset
df_transformed, info = dataset.transform(df)

# Create train/test/validation split
df_train, df_test, df_val = dataset.train_test_val_split(df_transformed)

Are you curious which datasets are available? Check out the Datasets Overview in the side bar to see the list!

Command-line Usage

The package provides a command-line interface for common operations:

# Generate and export metadata
python -m fairml_datasets metadata

# Export datasets in various processing stages
python -m fairml_datasets export-datasets --stage prepared

# Export dataset citations in BibTeX format
python -m fairml_datasets export-citations

Development

Development dependencies are managed via uv. For information on how to install uv, please refer to official installation instructions.

To install all dependencies, run:

uv sync --dev

Formatting

We're using ruff for formatting of code. You can autoformat and lint code by running:

ruff check . --fix && ruff format .

Tests

Tests are located in the tests/ directory. You can run all tests using pytest:

uv run pytest

License

Due to restrictions in some of the third-party code we include, this work is licensed under two licenses.

The primary license of this work is Creative Commons Attribution 4.0 International License (CC BY 4.0). This license applies to all assets generated by the authors of this work. It does NOT apply to the generate_synthetic_data.py script, which instead is licensed under GNU GPLv3.

The second license, which applies to the complete repository, is the more restrictive GNU GENERAL PUBLIC LICENSE 3 (GNU GPLv3).

Please note that this licensing information only refers to the code, annotations and generated metadata. Individual datasets which are loaded and exported by this package may have different licenses. Please refer to individual datasets and their sources for dataset-level information.

Project details

Release history Release notifications | RSS feed

0.2.5

Dec 17, 2025

0.2.4

Dec 17, 2025

0.2.3

Dec 17, 2025

0.2.2

Dec 10, 2025

This version

0.2.1

Dec 10, 2025

0.2.0

Dec 5, 2025

0.1.4

Oct 8, 2025

0.1.3

Oct 8, 2025

0.1.2

Oct 8, 2025

0.1.1

Oct 7, 2025

0.1.0

Oct 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fairml_datasets-0.2.1.tar.gz (104.0 kB view details)

Uploaded Dec 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fairml_datasets-0.2.1-py3-none-any.whl (105.4 kB view details)

Uploaded Dec 10, 2025 Python 3

File details

Details for the file fairml_datasets-0.2.1.tar.gz.

File metadata

Download URL: fairml_datasets-0.2.1.tar.gz
Upload date: Dec 10, 2025
Size: 104.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fairml_datasets-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`81c472f35bd2c8a3003eefd0dbbd9db43aed1b8e3f39f7d56efbee484817bff3`
MD5	`57c4b518d758a1ddf11328c780069fea`
BLAKE2b-256	`6a9c0d8ce396afacc6fbe17d7347424ec515dc0e8730598bfe4d1523d51722ea`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fairml_datasets-0.2.1.tar.gz:

Publisher: publish.yml on reliable-ai/fairground

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fairml_datasets-0.2.1.tar.gz
- Subject digest: 81c472f35bd2c8a3003eefd0dbbd9db43aed1b8e3f39f7d56efbee484817bff3
- Sigstore transparency entry: 757545656
- Sigstore integration time: Dec 10, 2025
Source repository:
- Permalink: reliable-ai/fairground@e96eda98bbfe9ead9951abef392cf1b6fabae307
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/reliable-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e96eda98bbfe9ead9951abef392cf1b6fabae307
- Trigger Event: release

File details

Details for the file fairml_datasets-0.2.1-py3-none-any.whl.

File metadata

Download URL: fairml_datasets-0.2.1-py3-none-any.whl
Upload date: Dec 10, 2025
Size: 105.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fairml_datasets-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8b96a2840f735125e4891f50845aea85f3b35e67ee344e4fe837266e8f1a4e23`
MD5	`bd294bf1d0d3b673336049e8b9b2196f`
BLAKE2b-256	`ec458ce62fbe3ce0fdecb9b970dfe3cd02d7446ce7f1f1a763f43a3653bd262f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fairml_datasets-0.2.1-py3-none-any.whl:

Publisher: publish.yml on reliable-ai/fairground

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fairml_datasets-0.2.1-py3-none-any.whl
- Subject digest: 8b96a2840f735125e4891f50845aea85f3b35e67ee344e4fe837266e8f1a4e23
- Sigstore transparency entry: 757545661
- Sigstore integration time: Dec 10, 2025
Source repository:
- Permalink: reliable-ai/fairground@e96eda98bbfe9ead9951abef392cf1b6fabae307
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/reliable-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e96eda98bbfe9ead9951abef392cf1b6fabae307
- Trigger Event: release

fairml-datasets 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

FairML Datasets

Overview

Key Features

Installation

Quick Start

Command-line Usage

Development

Formatting

Tests

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance