A Python library for consistent preprocessing of tabular data with automatic type inference, caching, and stratified splitting

These details have not been verified by PyPI

Project links

Project description

Table Toolkit (tabkit)

A python library for consistent preprocessing of tabular data. Handles column type inference, missing value imputation, feature binning, stratified split/sampling and more in a configuration-driven manner. I made this toolkit because I needed a way to reliably preprocess/cache datasets in a reproducible manner.

Installation

pip install git+https://github.com/inwonakng/tabkit.git@main

This package has been tested only with Python 3.10 and above.

Quick Start

from tabkit import TableProcessor

# Define your dataset and processing configs as plain dicts
dataset_config = {
    "dataset_name": "my_dataset",
    "data_source": "disk",
    "file_path": "path/to/your/data.csv",
    "file_type": "csv",
    "label_col": "target"
}

processor_config = {
    "task_kind": "classification",  # or "regression"
    "n_splits": 5,
    "random_state": 42
}

# Create processor
processor = TableProcessor(
    dataset_config=dataset_config,
    config=processor_config
)

# Prepare data (this caches results for future runs)
processor.prepare()

# Get splits
X_train, y_train = processor.get_split("train")
X_val, y_val = processor.get_split("val")
X_test, y_test = processor.get_split("test")

# Or get the raw dataframe
df = processor.get("raw_df")

For a complete example, see examples/basic_usage.py.

Features

Automatic type inference: Detects categorical, continuous, binary, and datetime columns
Flexible preprocessing pipelines: Chain transforms like imputation, encoding, scaling, discretization
Smart caching: Preprocessed data is cached based on config hash - perfect for distributed training
Stratified splitting: Automatically handles stratified train/val/test splits
Reproducible: Same config always produces same results

Configuration Options

All configuration is done via plain Python dictionaries. See defaults in src/tabkit/data/table_processor.py.

Dataset Config

{
    "dataset_name": str,       # Name for your dataset
    "data_source": str,        # "disk", "openml", "uci"
    "file_path": str,          # Path to your data file (for disk source)
    "file_type": str,          # "csv" or "parquet" (for disk source)
    "label_col": str,          # Name of the target column
    # ... see DEFAULT_DATASET_CONFIG for all options
}

Processor Config

{
    "task_kind": str,          # "classification" or "regression"
    "random_state": int,       # Random seed (default: 0)
    "pipeline": list,          # Preprocessing pipeline (uses defaults if None)
    "exclude_columns": list,   # Column names to exclude
    "exclude_labels": list,    # Label values to exclude

    # Splitting configuration - see next section for details
    "test_ratio": float,       # e.g., 0.2 for 20% test (ratio mode)
    "val_ratio": float,        # e.g., 0.1 for 10% validation (ratio mode)
    "n_splits": int,           # Number of folds (K-fold mode, default: 10)
    "split_idx": int,          # Which fold for test (K-fold mode)
    # ... see DEFAULT_TABLE_PROCESSOR_CONFIG for all options
}

Data Splitting Modes

Tabkit supports two distinct approaches for splitting your data into train/validation/test sets. Choose based on your use case:

Mode 1: Ratio-Based Splitting (Quick & Simple)

When to use:

You want a simple percentage-based split (e.g., 70/15/15)
You're doing quick prototyping or one-off experiments
You don't need full dataset coverage

How it works:

Performs a single random stratified split based on specified ratios
Fast and intuitive
Different random seeds give different splits, but no systematic coverage

Example:

config = {
    "test_ratio": 0.2,    # 20% test
    "val_ratio": 0.1,     # 10% validation
    "random_state": 42    # 70% training
}

Mode 2: K-Fold Based Splitting (Robust & Reproducible)

When to use:

You need robust cross-validation
You want to ensure every sample appears in the test set across multiple runs
You're benchmarking models or doing comprehensive evaluation

How it works:

Uses K-fold cross-validation for systematic data splitting
By varying split_idx from 0 to n_splits-1, every sample appears in the test set exactly once
Provides systematic coverage of your entire dataset
Default: 10 splits = 10% test, then 9 sub-splits on training portion = ~11% val, ~79% train

Example:

# Run 1: Use fold 0 as test
config = {"n_splits": 5, "split_idx": 0}  # 20% test, rest train+val

# Run 2: Use fold 1 as test
config = {"n_splits": 5, "split_idx": 1}  # Different 20% test

# ... Run 3-5 to cover all data in test set

Which Mode is Used?

Priority: If both test_ratio and val_ratio are set, ratio-based splitting is used. Otherwise, K-fold splitting is used.

# This uses RATIO mode
config = {"test_ratio": 0.2, "val_ratio": 0.1}

# This uses K-FOLD mode
config = {"n_splits": 10, "split_idx": 0}

# This also uses K-FOLD mode (ratios are None by default)
config = {}  # Uses all defaults

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2025.11.9

Nov 9, 2025

2025.10.22.post1

Oct 23, 2025

2025.10.22

Oct 23, 2025

2025.10.15.post2

Oct 15, 2025

2025.10.15.post1

Oct 15, 2025

2025.10.15

Oct 15, 2025

2025.10.15a0 pre-release

Oct 15, 2025

0.1.2

Oct 15, 2025

This version

0.1.1

Oct 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

table_toolkit-0.1.1.tar.gz (47.6 kB view details)

Uploaded Oct 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

table_toolkit-0.1.1-py3-none-any.whl (27.1 kB view details)

Uploaded Oct 14, 2025 Python 3

File details

Details for the file table_toolkit-0.1.1.tar.gz.

File metadata

Download URL: table_toolkit-0.1.1.tar.gz
Upload date: Oct 14, 2025
Size: 47.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for table_toolkit-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`127b1fb13fcb98d8aaee45841b4bc19a9542504edea777ec2adc3ca1137d306a`
MD5	`bf89cca2e4c93e9510c2f3bc762d5210`
BLAKE2b-256	`e783419379bcf2346f065c189b62e0f98cd26ddc6b5d79cc86ecae215249147a`

See more details on using hashes here.

File details

Details for the file table_toolkit-0.1.1-py3-none-any.whl.

File metadata

Download URL: table_toolkit-0.1.1-py3-none-any.whl
Upload date: Oct 14, 2025
Size: 27.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for table_toolkit-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e6286215e4ea13f9b5437b14a2f0ad01447c595e6faa387c66a9a17447d0c72`
MD5	`2fe8e332f9b2fd37da6181543f15572a`
BLAKE2b-256	`a151d7b1b8de4bee7fe4fcbf0a2402a9d20601c57edecba1decd154206f2282b`

See more details on using hashes here.

table-toolkit 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Table Toolkit (tabkit)

Installation

Quick Start

Features

Configuration Options

Dataset Config

Processor Config

Data Splitting Modes

Mode 1: Ratio-Based Splitting (Quick & Simple)

Mode 2: K-Fold Based Splitting (Robust & Reproducible)

Which Mode is Used?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes