A library to create random balanced samples through constraint optimization

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3 :: Only
Topic
- Scientific/Engineering

Project description

sample-dataset

sample-dataset

sample-dataset is a small library for generating balanced, constraint-driven samples from tabular data, using:

pandas for data handling
Google OR-Tools CP-SAT for constraint optimization

It allows you to divide your dataset into buckets (e.g., train/test × attribute combinations) while respecting arbitrary minimum size requirements provided by the user.

This is useful for:

Train/test splits with structural constraints
Balanced dataset construction
Controlled sampling for linguistic, NLP, or behavioral datasets
Any application where "random sampling" must satisfy non-trivial rules

Features

Constraint-based sampling using OR-Tools
Flexible bucket definitions via a separate minima dataframe
Supports arbitrary bucket-defining columns (e.g., split, feature_a, feature_b, split1, split2, ...)
Automatically infers which dataset rows are eligible for which buckets
Guarantees minimum bucket sizes
Supports multiple randomized feasible solutions
Simple API (assign_buckets, assign_buckets_multiple)

Installation

pip install sample-dataset

Quick Start

Import your data

import pandas as pd
from sample_dataset import assign_buckets

Your dataset (df) might look like:

df = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "feature_a": ["a", "a", "su", "su"],
    "feature_b": ["yes", "no", "yes", "no"],
    "context": ["...", "...", "...", "..."],
})

Define the bucket structure + minima

df_minima = pd.DataFrame({
    "split": ["train", "test", "train", "test"],
    "feature_a": ["a", "a", "su", "su"],
    "feature_b": ["yes", "yes", "no", "no"],
    "min_required": [150, 50, 150, 50],
})

Each row represents a bucket. All columns except min_required define the bucket identity.

Assign buckets

df_out = assign_buckets(df, df_minima, verbose=True)
print(df_out.head())

Output:

   ID feature_a feature_b   context      bucket
0   1        a       yes       ...   train|a|yes
1   2        a        no       ...    test|a|no
2   3       su       yes       ...  train|su|yes
3   4       su        no       ...   test|su|no

Multiple randomized balanced samples

To generate N different feasible assignments, use:

from sample_dataset import assign_buckets_multiple

df_samples = assign_buckets_multiple_wide(df, df_minima, n_samples=3)
print(df_samples.head())

You’ll get:

bucket_0      bucket_1      bucket_2
train|a|yes   test|a|yes    test|a|yes
...

How it works

The code interprets df_minima as the full set of buckets.
Matches rows to buckets when all key_cols match
Enforces minimum bucket sizes
Forces each row to belong to exactly one bucket
Uses a randomized objective to obtain diverse feasible assignments
Solved using OR-Tools’ CP-SAT engine

Requirements

Python ≥ 3.9
pandas
numpy
ortools

These are installed automatically.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3 :: Only
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

This version

0.3.0

Dec 15, 2025

0.2.0

Dec 15, 2025

0.1.1

Dec 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sample_dataset-0.3.0.tar.gz (33.5 kB view details)

Uploaded Dec 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sample_dataset-0.3.0-py3-none-any.whl (31.9 kB view details)

Uploaded Dec 15, 2025 Python 3

File details

Details for the file sample_dataset-0.3.0.tar.gz.

File metadata

Download URL: sample_dataset-0.3.0.tar.gz
Upload date: Dec 15, 2025
Size: 33.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for sample_dataset-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`37f42994d2855d4116bc1221a3423cb5aa5af738919725d3cf95dfb2d291b0aa`
MD5	`43f0eef2b7e7281972e8b3520d8104e9`
BLAKE2b-256	`d45154c2b386e1145ea9c0ebd1791e198351124548ad5ec3314eeb1c793d6589`

See more details on using hashes here.

File details

Details for the file sample_dataset-0.3.0-py3-none-any.whl.

File metadata

Download URL: sample_dataset-0.3.0-py3-none-any.whl
Upload date: Dec 15, 2025
Size: 31.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for sample_dataset-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ddc44a0141a74bdc4c581499dae5a45b1ddc9dd2a321a232e17ab4531d4598d9`
MD5	`749b766446722991acf5b0ba5b5855e7`
BLAKE2b-256	`d46b3db71fa31f014681073c9a8ff5d4d0728c80df94960cebf498b600b60a35`

See more details on using hashes here.

sample-dataset 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sample-dataset

Features

Installation

Quick Start

Import your data

Define the bucket structure + minima

Assign buckets

Multiple randomized balanced samples

How it works

Requirements

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes