Skip to main content

A library to create random balanced samples through constraint optimization

Project description

sample-dataset

sample-dataset is a small library for generating balanced, constraint-driven samples from tabular data, using:

  • pandas for data handling
  • Google OR-Tools CP-SAT for constraint optimization

It allows you to divide your dataset into buckets (e.g., train/test × attribute combinations) while respecting arbitrary minimum size requirements provided by the user.

This is useful for:

  • Train/test splits with structural constraints
  • Balanced dataset construction
  • Controlled sampling for linguistic, NLP, or behavioral datasets
  • Any application where "random sampling" must satisfy non-trivial rules

Features

  • Constraint-based sampling using OR-Tools
  • Flexible bucket definitions via a separate minima dataframe
  • Supports arbitrary bucket-defining columns (e.g., split, feature_a, feature_b, split1, split2, ...)
  • Automatically infers which dataset rows are eligible for which buckets
  • Guarantees minimum bucket sizes
  • Supports multiple randomized feasible solutions
  • Simple API (assign_buckets, assign_buckets_multiple)

Installation

pip install sample-dataset

Quick Start

Import your data

import pandas as pd
from sample_dataset import assign_buckets

Your dataset (df) might look like:

df = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "feature_a": ["a", "a", "su", "su"],
    "feature_b": ["yes", "no", "yes", "no"],
    "context": ["...", "...", "...", "..."],
})

Define the bucket structure + minima

df_minima = pd.DataFrame({
    "split": ["train", "test", "train", "test"],
    "feature_a": ["a", "a", "su", "su"],
    "feature_b": ["yes", "yes", "no", "no"],
    "min_required": [150, 50, 150, 50],
})

Each row represents a bucket. All columns except min_required define the bucket identity.

Assign buckets

df_out = assign_buckets(df, df_minima, verbose=True)
print(df_out.head())

Output:

   ID feature_a feature_b   context      bucket
0   1        a       yes       ...   train|a|yes
1   2        a        no       ...    test|a|no
2   3       su       yes       ...  train|su|yes
3   4       su        no       ...   test|su|no

Multiple randomized balanced samples

To generate N different feasible assignments, use:

from sample_dataset import assign_buckets_multiple

df_samples = assign_buckets_multiple_wide(df, df_minima, n_samples=3)
print(df_samples.head())

You’ll get:

bucket_0      bucket_1      bucket_2
train|a|yes   test|a|yes    test|a|yes
...

How it works

  • The code interprets df_minima as the full set of buckets.
  • Matches rows to buckets when all key_cols match
  • Enforces minimum bucket sizes
  • Forces each row to belong to exactly one bucket
  • Uses a randomized objective to obtain diverse feasible assignments
  • Solved using OR-Tools’ CP-SAT engine

Requirements

  • Python ≥ 3.9
  • pandas
  • numpy
  • ortools

These are installed automatically.

Links

  • Source code: [https://github.com/LaboratorioSperimentale/sample-dataset]
  • Issues: [https://github.com/LaboratorioSperimentale/sample-dataset/issues]
  • PyPI: [https://pypi.org/project/sample-dataset/]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sample_dataset-0.3.0.tar.gz (33.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sample_dataset-0.3.0-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file sample_dataset-0.3.0.tar.gz.

File metadata

  • Download URL: sample_dataset-0.3.0.tar.gz
  • Upload date:
  • Size: 33.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for sample_dataset-0.3.0.tar.gz
Algorithm Hash digest
SHA256 37f42994d2855d4116bc1221a3423cb5aa5af738919725d3cf95dfb2d291b0aa
MD5 43f0eef2b7e7281972e8b3520d8104e9
BLAKE2b-256 d45154c2b386e1145ea9c0ebd1791e198351124548ad5ec3314eeb1c793d6589

See more details on using hashes here.

File details

Details for the file sample_dataset-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: sample_dataset-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for sample_dataset-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ddc44a0141a74bdc4c581499dae5a45b1ddc9dd2a321a232e17ab4531d4598d9
MD5 749b766446722991acf5b0ba5b5855e7
BLAKE2b-256 d46b3db71fa31f014681073c9a8ff5d4d0728c80df94960cebf498b600b60a35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page