Skip to main content

A library to create random balanced samples through constraint optimization

Project description

sample-dataset

sample-dataset is a small library for generating balanced, constraint-driven samples from tabular data, using:

  • pandas for data handling
  • Google OR-Tools CP-SAT for constraint optimization

It allows you to divide your dataset into buckets (e.g., train/test × attribute combinations) while respecting arbitrary minimum size requirements provided by the user.

This is useful for:

  • Train/test splits with structural constraints
  • Balanced dataset construction
  • Controlled sampling for linguistic, NLP, or behavioral datasets
  • Any application where "random sampling" must satisfy non-trivial rules

Features

  • Constraint-based sampling using OR-Tools
  • Flexible bucket definitions via a separate minima dataframe
  • Supports arbitrary bucket-defining columns (e.g., split, feature_a, feature_b, split1, split2, ...)
  • Automatically infers which dataset rows are eligible for which buckets
  • Guarantees minimum bucket sizes
  • Supports multiple randomized feasible solutions
  • Simple API (assign_buckets, assign_buckets_multiple)

Installation

pip install sample-dataset

Quick Start

Import your data

import pandas as pd
from sample_dataset import assign_buckets

Your dataset (df) might look like:

df = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "feature_a": ["a", "a", "su", "su"],
    "feature_b": ["yes", "no", "yes", "no"],
    "context": ["...", "...", "...", "..."],
})

Define the bucket structure + minima

df_minima = pd.DataFrame({
    "split": ["train", "test", "train", "test"],
    "feature_a": ["a", "a", "su", "su"],
    "feature_b": ["yes", "yes", "no", "no"],
    "min_required": [150, 50, 150, 50],
})

Each row represents a bucket. All columns except min_required define the bucket identity.

Assign buckets

df_out = assign_buckets(df, df_minima, verbose=True)
print(df_out.head())

Output:

   ID feature_a feature_b   context      bucket
0   1        a       yes       ...   train|a|yes
1   2        a        no       ...    test|a|no
2   3       su       yes       ...  train|su|yes
3   4       su        no       ...   test|su|no

Multiple randomized balanced samples

To generate N different feasible assignments, use:

from sample_dataset import assign_buckets_multiple

df_samples = assign_buckets_multiple_wide(df, df_minima, n_samples=3)
print(df_samples.head())

You’ll get:

bucket_0      bucket_1      bucket_2
train|a|yes   test|a|yes    test|a|yes
...

How it works

  • The code interprets df_minima as the full set of buckets.
  • Matches rows to buckets when all key_cols match
  • Enforces minimum bucket sizes
  • Forces each row to belong to exactly one bucket
  • Uses a randomized objective to obtain diverse feasible assignments
  • Solved using OR-Tools’ CP-SAT engine

Requirements

  • Python ≥ 3.9
  • pandas
  • numpy
  • ortools

These are installed automatically.

Links

  • Source code: [https://github.com/LaboratorioSperimentale/sample-dataset]
  • Issues: [https://github.com/LaboratorioSperimentale/sample-dataset/issues]
  • PyPI: [https://pypi.org/project/sample-dataset/]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sample_dataset-0.2.0.tar.gz (33.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sample_dataset-0.2.0-py3-none-any.whl (31.8 kB view details)

Uploaded Python 3

File details

Details for the file sample_dataset-0.2.0.tar.gz.

File metadata

  • Download URL: sample_dataset-0.2.0.tar.gz
  • Upload date:
  • Size: 33.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for sample_dataset-0.2.0.tar.gz
Algorithm Hash digest
SHA256 653c003c2fb6dd019960bc7b9e17fc4d16e99432e34fc36a497c9a454acf7424
MD5 99a4a90839f1b5419ca42a234a47551b
BLAKE2b-256 a6f483f64aa76381503669191f8d324efe262aee8378b1e818e68f6d42f2e541

See more details on using hashes here.

File details

Details for the file sample_dataset-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: sample_dataset-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 31.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for sample_dataset-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b271c47dc6a1199f7115e6cec7e32ebf895126d6e5827ad9a858e639e43008a
MD5 2a3393a9b9040d6b850182a61f619227
BLAKE2b-256 a937d4f90f2a897c46239f826a0d0fce8a91481639e9eef0b5da972e3a0aa26d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page