Skip to main content

A library to create random balanced samples through constraint optimization

Project description

sample-dataset

sample-dataset is a small library for generating balanced, constraint-driven samples from tabular data, using:

  • pandas for data handling
  • Google OR-Tools CP-SAT for constraint optimization

It allows you to divide your dataset into buckets (e.g., train/test × attribute combinations) while respecting arbitrary minimum size requirements provided by the user.

This is useful for:

  • Train/test splits with structural constraints
  • Balanced dataset construction
  • Controlled sampling for linguistic, NLP, or behavioral datasets
  • Any application where "random sampling" must satisfy non-trivial rules

Features

  • Constraint-based sampling using OR-Tools
  • Flexible bucket definitions via a separate minima dataframe
  • Supports arbitrary bucket-defining columns (e.g., split, feature_a, feature_b, split1, split2, ...)
  • Automatically infers which dataset rows are eligible for which buckets
  • Guarantees minimum bucket sizes
  • Supports multiple randomized feasible solutions
  • Simple API (assign_buckets, assign_buckets_multiple)

Installation

pip install sample-dataset

Quick Start

Import your data

import pandas as pd
from sample_dataset import assign_buckets

Your dataset (df) might look like:

df = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "feature_a": ["a", "a", "su", "su"],
    "feature_b": ["yes", "no", "yes", "no"],
    "context": ["...", "...", "...", "..."],
})

Define the bucket structure + minima

df_minima = pd.DataFrame({
    "split": ["train", "test", "train", "test"],
    "feature_a": ["a", "a", "su", "su"],
    "feature_b": ["yes", "yes", "no", "no"],
    "min_required": [150, 50, 150, 50],
})

Each row represents a bucket. All columns except min_required define the bucket identity.

Assign buckets

df_out = assign_buckets(df, df_minima, verbose=True)
print(df_out.head())

Output:

   ID feature_a feature_b   context      bucket
0   1        a       yes       ...   train|a|yes
1   2        a        no       ...    test|a|no
2   3       su       yes       ...  train|su|yes
3   4       su        no       ...   test|su|no

Multiple randomized balanced samples

To generate N different feasible assignments, use:

from sample_dataset import assign_buckets_multiple

df_samples = assign_buckets_multiple_wide(df, df_minima, n_samples=3)
print(df_samples.head())

You’ll get:

bucket_0      bucket_1      bucket_2
train|a|yes   test|a|yes    test|a|yes
...

How it works

  • The code interprets df_minima as the full set of buckets.
  • Matches rows to buckets when all key_cols match
  • Enforces minimum bucket sizes
  • Forces each row to belong to exactly one bucket
  • Uses a randomized objective to obtain diverse feasible assignments
  • Solved using OR-Tools’ CP-SAT engine

Requirements

  • Python ≥ 3.9
  • pandas
  • numpy
  • ortools

These are installed automatically.

Links

  • Source code: [https://github.com/LaboratorioSperimentale/sample-dataset]
  • Issues: [https://github.com/LaboratorioSperimentale/sample-dataset/issues]
  • PyPI: [https://pypi.org/project/sample-dataset/]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sample_dataset-0.1.1.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sample_dataset-0.1.1-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file sample_dataset-0.1.1.tar.gz.

File metadata

  • Download URL: sample_dataset-0.1.1.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for sample_dataset-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c4615dfa9021e3b1b72eb1b2c27ce2a6b1f4923ba01692ef1a4ad58a661c070f
MD5 59018954cc2786fe0f78f6c76055844a
BLAKE2b-256 70b54a75b5d2a85199236dc0a3d93a5cea9f5e5b3a207b518542b4b3a72efdae

See more details on using hashes here.

File details

Details for the file sample_dataset-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sample_dataset-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for sample_dataset-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cb96f6921ba24c4c7bb662066a63e3919fa9f391d4e8b0c9b78adad3f7ece459
MD5 ad0ebf46ac1123f26ac3b23904dc9871
BLAKE2b-256 d4b4eebb64a7f502fe250e305b8f0d987ef5046c0f54ef2989478e98e2a366d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page