A library to create random balanced samples through constraint optimization
Project description
sample-dataset
sample-dataset is a small library for generating balanced, constraint-driven samples from tabular data, using:
- pandas for data handling
- Google OR-Tools CP-SAT for constraint optimization
It allows you to divide your dataset into buckets (e.g., train/test × attribute combinations) while respecting arbitrary minimum size requirements provided by the user.
This is useful for:
- Train/test splits with structural constraints
- Balanced dataset construction
- Controlled sampling for linguistic, NLP, or behavioral datasets
- Any application where "random sampling" must satisfy non-trivial rules
Features
- Constraint-based sampling using OR-Tools
- Flexible bucket definitions via a separate minima dataframe
- Supports arbitrary bucket-defining columns (e.g., split, feature_a, feature_b, split1, split2, ...)
- Automatically infers which dataset rows are eligible for which buckets
- Guarantees minimum bucket sizes
- Supports multiple randomized feasible solutions
- Simple API (assign_buckets, assign_buckets_multiple)
Installation
pip install sample-dataset
Quick Start
Import your data
import pandas as pd
from sample_dataset import assign_buckets
Your dataset (df) might look like:
df = pd.DataFrame({
"ID": [1, 2, 3, 4],
"feature_a": ["a", "a", "su", "su"],
"feature_b": ["yes", "no", "yes", "no"],
"context": ["...", "...", "...", "..."],
})
Define the bucket structure + minima
df_minima = pd.DataFrame({
"split": ["train", "test", "train", "test"],
"feature_a": ["a", "a", "su", "su"],
"feature_b": ["yes", "yes", "no", "no"],
"min_required": [150, 50, 150, 50],
})
Each row represents a bucket. All columns except min_required define the bucket identity.
Assign buckets
df_out = assign_buckets(df, df_minima, verbose=True)
print(df_out.head())
Output:
ID feature_a feature_b context bucket
0 1 a yes ... train|a|yes
1 2 a no ... test|a|no
2 3 su yes ... train|su|yes
3 4 su no ... test|su|no
Multiple randomized balanced samples
To generate N different feasible assignments, use:
from sample_dataset import assign_buckets_multiple
df_samples = assign_buckets_multiple_wide(df, df_minima, n_samples=3)
print(df_samples.head())
You’ll get:
bucket_0 bucket_1 bucket_2
train|a|yes test|a|yes test|a|yes
...
How it works
- The code interprets df_minima as the full set of buckets.
- Matches rows to buckets when all key_cols match
- Enforces minimum bucket sizes
- Forces each row to belong to exactly one bucket
- Uses a randomized objective to obtain diverse feasible assignments
- Solved using OR-Tools’ CP-SAT engine
Requirements
- Python ≥ 3.9
- pandas
- numpy
- ortools
These are installed automatically.
Links
- Source code: [https://github.com/LaboratorioSperimentale/sample-dataset]
- Issues: [https://github.com/LaboratorioSperimentale/sample-dataset/issues]
- PyPI: [https://pypi.org/project/sample-dataset/]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sample_dataset-0.2.0.tar.gz.
File metadata
- Download URL: sample_dataset-0.2.0.tar.gz
- Upload date:
- Size: 33.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
653c003c2fb6dd019960bc7b9e17fc4d16e99432e34fc36a497c9a454acf7424
|
|
| MD5 |
99a4a90839f1b5419ca42a234a47551b
|
|
| BLAKE2b-256 |
a6f483f64aa76381503669191f8d324efe262aee8378b1e818e68f6d42f2e541
|
File details
Details for the file sample_dataset-0.2.0-py3-none-any.whl.
File metadata
- Download URL: sample_dataset-0.2.0-py3-none-any.whl
- Upload date:
- Size: 31.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b271c47dc6a1199f7115e6cec7e32ebf895126d6e5827ad9a858e639e43008a
|
|
| MD5 |
2a3393a9b9040d6b850182a61f619227
|
|
| BLAKE2b-256 |
a937d4f90f2a897c46239f826a0d0fce8a91481639e9eef0b5da972e3a0aa26d
|