Skip to main content

A Python library for consistent preprocessing of tabular data with automatic type inference, caching, and stratified splitting

Project description

Table Toolkit (tabkit)

A python library for consistent preprocessing of tabular data. Handles column type inference, missing value imputation, feature binning, stratified split/sampling and more in a configuration-driven manner. I made this toolkit because I needed a way to reliably preprocess/cache datasets in a reproducible manner.

Installation

Stable release via PyPI:

pip install table-toolkit

Or install the latest development version directly from GitHub:

pip install git+https://github.com/inwonakng/tabkit.git@main

This package has been tested only with Python 3.10 and above.

Quick Start

from tabkit import TableProcessor, DatasetConfig, TableProcessorConfig

# Define your dataset and processing configs
dataset_config = DatasetConfig(
    dataset_name="my_dataset",
    data_source="disk",
    file_path="path/to/your/data.csv",
    file_type="csv",
    label_col="target"
)

processor_config = TableProcessorConfig(
    task_kind="classification",  # or "regression"
    n_splits=5,
    random_state=42
)

# Create processor
processor = TableProcessor(
    dataset_config=dataset_config,
    config=processor_config
)

# Prepare data (this caches results for future runs)
processor.prepare()

# Get splits
X_train, y_train = processor.get_split("train")
X_val, y_val = processor.get_split("val")
X_test, y_test = processor.get_split("test")

# Or get the raw dataframe
df = processor.get("raw_df")

Note: You can also use plain dictionaries instead of config classes - both work identically! See Configuration Options below.

For more examples, see examples/basic_usage.py.

Features

  • Automatic type inference: Detects categorical, continuous, binary, and datetime columns
  • Flexible preprocessing pipelines: Chain transforms like imputation, encoding, scaling, discretization
  • Smart caching: Preprocessed data is cached based on config hash - perfect for distributed training
  • Stratified splitting: Automatically handles stratified train/val/test splits
  • Reproducible: Same config always produces same results

Configuration Options

Tabkit provides type-safe configuration classes with IDE autocomplete and inline documentation. You can also use plain dictionaries if you prefer - both approaches work identically.

Using Config Classes (Recommended)

from tabkit import DatasetConfig, TableProcessorConfig

# Dataset configuration with type hints and autocomplete
dataset_config = DatasetConfig(
    dataset_name="my_dataset",
    data_source="disk",      # "disk", "openml", "uci", "automm"
    file_path="data.csv",
    file_type="csv",         # "csv" or "parquet"
    label_col="target"
)

# Processor configuration
processor_config = TableProcessorConfig(
    task_kind="classification",  # or "regression"
    random_state=42,
    pipeline=[...],              # Custom pipeline (optional)
    exclude_columns=["id"],      # Columns to exclude (optional)

    # Splitting configuration - see next section
    test_ratio=0.2,              # For ratio-based splitting
    val_ratio=0.1,               # For ratio-based splitting
    # OR
    n_splits=10,                 # For K-fold splitting
    split_idx=0                  # For K-fold splitting
)

For detailed documentation on all available options, see the docstrings in DatasetConfig and TableProcessorConfig, or check the config source.

Using Plain Dictionaries (Also supported)

# Same functionality, dictionary-based
dataset_config = {
    "dataset_name": "my_dataset",
    "data_source": "disk",
    "file_path": "data.csv",
    "file_type": "csv",
    "label_col": "target"
}

processor_config = {
    "task_kind": "classification",
    "test_ratio": 0.2,
    "val_ratio": 0.1,
    "random_state": 42
}

Data Splitting Modes

Tabkit supports two distinct approaches for splitting your data into train/validation/test sets. Choose based on your use case:

Mode 1: Ratio-Based Splitting (Quick & Simple)

When to use:

  • You want a simple percentage-based split (e.g., 70/15/15)
  • You're doing quick prototyping or one-off experiments
  • You don't need full dataset coverage

How it works:

  • Performs a single random stratified split based on specified ratios
  • Fast and intuitive
  • Different random seeds give different splits, but no systematic coverage

Example:

from tabkit import TableProcessorConfig

config = TableProcessorConfig(
    test_ratio=0.2,       # 20% test
    val_ratio=0.1,        # 10% validation
    random_state=42       # 70% training
)

Mode 2: K-Fold Based Splitting (Robust & Reproducible)

When to use:

  • You need robust cross-validation
  • You want to ensure every sample appears in the test set across multiple runs
  • You're benchmarking models or doing comprehensive evaluation

How it works:

  • Uses K-fold cross-validation for systematic data splitting
  • By varying split_idx from 0 to n_splits-1, every sample appears in the test set exactly once
  • Provides systematic coverage of your entire dataset
  • Default: 10 splits = 10% test, then 9 sub-splits on training portion = ~11% val, ~79% train

Example:

from tabkit import TableProcessorConfig

# Run 1: Use fold 0 as test
config = TableProcessorConfig(n_splits=5, split_idx=0)  # 20% test, rest train+val

# Run 2: Use fold 1 as test
config = TableProcessorConfig(n_splits=5, split_idx=1)  # Different 20% test

# ... Run 3-5 to cover all data in test set

Which Mode is Used?

Priority: If both test_ratio and val_ratio are set, ratio-based splitting is used. Otherwise, K-fold splitting is used.

# This uses RATIO mode
config = {"test_ratio": 0.2, "val_ratio": 0.1}

# This uses K-FOLD mode
config = {"n_splits": 10, "split_idx": 0}

# This also uses K-FOLD mode (ratios are None by default)
config = {}  # Uses all defaults

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

table_toolkit-2025.10.15a0.tar.gz (49.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

table_toolkit-2025.10.15a0-py3-none-any.whl (28.2 kB view details)

Uploaded Python 3

File details

Details for the file table_toolkit-2025.10.15a0.tar.gz.

File metadata

  • Download URL: table_toolkit-2025.10.15a0.tar.gz
  • Upload date:
  • Size: 49.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for table_toolkit-2025.10.15a0.tar.gz
Algorithm Hash digest
SHA256 c57bc5365f41145c28b011255002c0dbd7381d467a738787bad56c391eb00c89
MD5 85864f001f37dab6a9c2bbf714acb5eb
BLAKE2b-256 11e9019ac3dfdfdb961d1b244d27c9479df56bfa2e081786e2d540cfc628c91d

See more details on using hashes here.

File details

Details for the file table_toolkit-2025.10.15a0-py3-none-any.whl.

File metadata

File hashes

Hashes for table_toolkit-2025.10.15a0-py3-none-any.whl
Algorithm Hash digest
SHA256 4e0b321e59af74fbbd4e63ae23a98efb5167e4913edfc48e85410f18e391503b
MD5 b6de6c965044869e555f5402a7d0dbee
BLAKE2b-256 df22b41ca7d2d6aafc8e33a4ae1abe6adfa9d4747ca267fdf2cb688c4d33ae27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page