Framework for synthetic tabular data generation, evaluation, and artifact-based pipelines.

These details have not been verified by PyPI

Project description

Katabatic

A comprehensive framework for synthetic tabular data generation using state-of-the-art machine learning models including GANBLR and GReaT (Generation of Realistic Tabular data).

🚀 Features

Multiple Generative Models: Support for GANBLR (GAN-based Bayesian Learning Rules) and GReaT (transformer-based generation)
Automated Pipeline: End-to-end training, generation, and evaluation workflows
TSTR Evaluation: Train on Synthetic, Test on Real data evaluation methodology
Data Preprocessing: Automated tabular preprocessing (discretization and encoding)
Cross-Validation Support: Robust model validation capabilities
Extensible Architecture: Easy to add new models and evaluation metrics

🔧 Prerequisites

System Requirements

Operating System: macOS, Linux, or Windows
Python: 3.11.x (strictly required due to TensorFlow compatibility)
Memory: Minimum 8GB RAM (16GB+ recommended for large datasets)
GPU: NVIDIA GPU with CUDA support (optional but recommended for GReaT model)

Required Tools

1. Python Version Management with pyenv

macOS (via Homebrew):

# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install pyenv
brew install pyenv

# Add to shell profile
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
echo 'eval "$(pyenv init -)"' >> ~/.zshrc

# Restart shell or source profile
source ~/.zshrc

Linux (Ubuntu/Debian):

# Install dependencies
sudo apt update
sudo apt install -y make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev \
libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev python3-openssl git

# Install pyenv
curl https://pyenv.run | bash

# Add to shell profile
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc

# Restart shell
exec "$SHELL"

2. Install Python 3.11

# Install Python 3.11 using pyenv
pyenv install 3.11.9
pyenv global 3.11.9

# Verify installation
python --version  # Should output: Python 3.11.9

3. Package Management with Poetry

# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -

# Add Poetry to PATH (add to your shell profile)
export PATH="$HOME/.local/bin:$PATH"

# Verify installation
poetry --version

📦 Installation

1. Clone the Repository

git clone https://github.com/datascience-works/Katabatic.git
cd katabatic

2. Set Python Version

# Set local Python version for this project
pyenv local 3.11.9

3. Install Dependencies

Install matrix (PyPI / Poetry extras):

Use case	Command
Core only	`pip install katabatic` or `poetry install`
GANBLR (supported)	`pip install katabatic[ganblr]` or `poetry install -E ganblr`
GReaT (supported)	`pip install katabatic[great]` or `poetry install -E great`
TSTR + XGBoost	`pip install katabatic[eval]` or `poetry install -E eval`
Development	`poetry install --with dev`
All optional deps	`pip install katabatic[all]`

Experimental models (tabsyn, tabddpm, pategan, ctgan, etc.) are documented in docs/EXPERIMENTAL_MODELS.md.

# Minimal install (core + dev tools for contributors)
poetry install --with dev

# Supported models for local work
poetry install --with dev -E ganblr -E great -E eval

poetry shell

4. GPU Support (Optional)

If you have an NVIDIA GPU and want to use it for GReaT model training:

# Install CUDA-compatible versions
poetry add torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

5. Verify Installation

# Core import
python -c "import katabatic; print(katabatic.__version__)"

# After installing extras, e.g. poetry install -E ganblr -E great
python -c "from katabatic.models.registry import ModelRegistry; print(ModelRegistry.get_supported_models())"

🚀 Quick Start

Artifact pipeline (recommended)

Versioned datasets, models, and evaluations under artifacts/. See GANBLR_FLOW.md for details.

from katabatic.artifacts import LocalArtifactStore
from katabatic.models.ganblr.models import GANBLR
from katabatic.pipeline.train_test_split.pipeline import TrainTestSplitPipeline
from katabatic.utils.preprocess import preprocess_tabular

preprocess_tabular("raw_data/car.csv", "preprocessed_data/car.csv")

store = LocalArtifactStore("artifacts")
pipeline = TrainTestSplitPipeline(model=GANBLR())
results = pipeline.run(
    input_csv="preprocessed_data/car.csv",
    dataset_name="car",
    artifact_store=store,
    model_name="ganblr",
)
# results["model_ref"], results["evaluation_refs"] — TSTR metrics on disk

CLI:

katabatic register-dataset car preprocessed_data/car.csv --check-model ganblr

Legacy directory layout

from katabatic.models.ganblr.models import GANBLR
from katabatic.pipeline.train_test_split.pipeline import TrainTestSplitPipeline
from katabatic.utils.preprocess import preprocess_tabular

preprocess_tabular("raw_data/car.csv", "preprocessed_data/car.csv")
pipeline = TrainTestSplitPipeline(model=GANBLR())
pipeline.run(input_csv="preprocessed_data/car.csv", output_dir="sample_data/car")

Pipelines call Model.train(); GANBLR also exposes fit(x, y) for direct training.

Jupyter Notebook

For interactive development, launch Jupyter:

# Start Jupyter Lab
poetry run jupyter lab

# Or Jupyter Notebook
poetry run jupyter notebook

See example.ipynb for a complete walkthrough.

📖 Usage

Data Preprocessing

Katabatic requires discrete/categorical data. Use the built-in preprocessing utilities:

from katabatic.utils.preprocess import preprocess_tabular

# Discretize numerical features and encode categorical ones
preprocess_tabular(
    file_path="raw_data/your_dataset.csv",
    output_path="preprocessed_data/your_dataset.csv",
    bins=10,  # Number of bins for numerical discretization
    strategy='uniform'  # 'uniform', 'quantile', or 'kmeans'
)

Training Models

GANBLR Model

from katabatic.models.ganblr.models import GANBLR
import pandas as pd

# Load your data
X = pd.read_csv("path/to/features.csv")
y = pd.read_csv("path/to/labels.csv").values.ravel()

# Initialize and train model
model = GANBLR()
model.fit(X, y, k=2, epochs=100, batch_size=64)

# Generate synthetic data
synthetic_data = model.sample(size=1000)

GReaT Model

from katabatic.models.great.models import GReaT
import pandas as pd

# Load your data
data = pd.read_csv("path/to/your_data.csv")

# Initialize and train model
model = GReaT(
    llm='gpt-2',  # or 'microsoft/DialoGPT-medium'
    epochs=100,
    batch_size=8
)

trainer = model.fit(data)

# Generate synthetic data
synthetic_data = model.sample(
    n_samples=1000,
    temperature=0.7
)

Pipeline Usage

Katabatic provides automated pipelines for complete workflows:

from katabatic.pipeline.train_test_split.pipeline import TrainTestSplitPipeline
from katabatic.models.ganblr.models import GANBLR

# Create pipeline with GANBLR
pipeline = TrainTestSplitPipeline(model=GANBLR)

# Run complete workflow: split preprocessed CSV -> train model -> TSTR evaluation.
# Legacy mode: ``real_test_dir`` defaults to ``output_dir`` (where split_dataset
# writes ``x_test.csv`` / ``y_test.csv``). ``synthetic_dir`` defaults to
# ``synthetic/<basename(output_dir)>/<model_slug>/`` if omitted.
results = pipeline.run(
    input_csv='path/to/preprocessed_data.csv',
    output_dir='output/directory',
)
# Optional overrides:
#   synthetic_dir='...', real_test_dir='...'
# ``results`` is a dict with ``message``, ``output_dir``, ``synthetic_dir``,
# ``real_test_dir``, ``tstr_results``, and ``pipeline.last_model`` is the fitted instance.

🤖 Models

GANBLR (GAN-based Bayesian Learning Rules)

Type: GAN-based generative model
Best for: Discrete/categorical tabular data
Features:
- k-dependence Bayesian Networks
- Adversarial training
- High-quality discrete data generation

GReaT (Generation of Realistic Tabular Data)

Type: Transformer-based generative model
Best for: Mixed data types (numerical + categorical)
Features:
- Pre-trained language model fine-tuning
- Conditional generation
- Data imputation capabilities

📊 Evaluation

TSTR (Train on Synthetic, Test on Real)

Katabatic includes comprehensive evaluation using the TSTR methodology:

from katabatic.evaluate.tstr.evaluation import TSTREvaluation

# Initialize evaluator
evaluator = TSTREvaluation(
    synthetic_dir="path/to/synthetic/data",
    real_test_dir="path/to/real/test/data"
)

# Run evaluation with multiple ML models
results = evaluator.evaluate()

Supported Evaluation Models:

Logistic Regression
Multi-layer Perceptron (MLP)
Random Forest
XGBoost

Metrics:

Accuracy
F1 Score
AUC-ROC (for binary classification)

Statistical fidelity (marginal JSD/KLD, DCR) is available via katabatic.evaluate.fidelity.evaluation.StatisticalFidelityEvaluation in artifact pipeline runs.

🛠 Development

Recommended VS Code Extensions

# Install recommended extensions
code --install-extension ms-python.python
code --install-extension ms-python.flake8
code --install-extension ms-python.black-formatter
code --install-extension ms-toolsai.jupyter
code --install-extension ms-python.isort

Development Setup

git clone https://github.com/datascience-works/Katabatic.git
cd Katabatic

poetry install --with dev -E ganblr -E eval   # add -E great as needed

poetry check
poetry run ruff check katabatic tests
poetry run pytest                              # fast unit tests
poetry run pytest -m integration               # after installing model extras
poetry run mypy katabatic/                     # optional

Project Structure

Katabatic/
├── katabatic/                 # Installable package (PyPI wheel)
│   ├── models/                # GANBLR, GReaT, experimental generators
│   ├── pipeline/              # TrainTestSplitPipeline, cross-validation
│   ├── evaluate/              # TSTR, statistical fidelity
│   ├── artifacts/             # Versioned store helpers
│   └── utils/                 # preprocess, split_dataset, ...
├── artifacts/                 # Local run outputs (gitignored)
├── docs/                      # EXPERIMENTAL_MODELS.md, etc.
├── examples/                  # Notebooks per model
├── tests/                     # Unit + integration tests
├── GANBLR_FLOW.md             # Artifact pipeline walkthrough
├── pyproject.toml
└── README.md

Building from Source

# Build package
poetry build

# Install locally
pip install dist/katabatic-*.whl

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Workflow

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Standards & Style Guide

We maintain high code quality standards to ensure consistency, readability, and maintainability across the codebase.

Python Style Guidelines

PEP 8 Compliance: All code must follow PEP 8 style guidelines
Line Length: Maximum 88 characters (Black's default)
Imports: Use isort for import organization
Type Hints: Add type hints for all public functions and class methods
Docstrings: Include docstrings for all modules, classes, and functions using Google or NumPy style

Code Formatting with autopep8

We use autopep8 as our primary code formatter to ensure consistent code style:

# Install autopep8 (included in dev dependencies)
poetry add --group dev autopep8

# Format a single file
poetry run autopep8 --in-place --aggressive --aggressive your_file.py

# Format entire project
poetry run autopep8 --in-place --aggressive --aggressive --recursive .

# Check formatting without making changes
poetry run autopep8 --diff --aggressive --aggressive --recursive .

Recommended autopep8 Configuration

Create a .autopep8 configuration file in the project root:

# .autopep8
[autopep8]
max_line_length = 88
ignore = E203,W503
aggressive = 2
recursive = true

Additional Formatting Tools

While autopep8 is our primary formatter, you may also use these complementary tools:

# isort for import sorting
poetry run isort .

# Black as an alternative formatter (if preferred)
poetry run black .

# flake8 for linting
poetry run flake8 katabatic/

# mypy for static type checking
poetry run mypy katabatic/

Pre-commit Hooks

Set up pre-commit hooks to automatically format code before commits:

# Install pre-commit
poetry add --group dev pre-commit

# Create .pre-commit-config.yaml
cat > .pre-commit-config.yaml << EOF
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.4.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files

  - repo: https://github.com/pre-commit/mirrors-autopep8
    rev: v2.0.2
    hooks:
      - id: autopep8
        args: [--aggressive, --aggressive, --in-place]

  - repo: https://github.com/pycqa/isort
    rev: 5.12.0
    hooks:
      - id: isort
        args: [--profile, black]

  - repo: https://github.com/pycqa/flake8
    rev: 6.0.0
    hooks:
      - id: flake8
        args: [--max-line-length=88, --ignore=E203,W503]
EOF

# Install the hooks
poetry run pre-commit install

VS Code Configuration

Add these settings to your VS Code workspace settings (.vscode/settings.json):

{
  "python.formatting.provider": "autopep8",
  "python.formatting.autopep8Args": [
    "--aggressive",
    "--aggressive",
    "--max-line-length=88"
  ],
  "python.linting.enabled": true,
  "python.linting.flake8Enabled": true,
  "python.linting.flake8Args": ["--max-line-length=88", "--ignore=E203,W503"],
  "editor.formatOnSave": true,
  "editor.codeActionsOnSave": {
    "source.organizeImports": true
  },
  "python.sortImports.args": ["--profile", "black"]
}

Code Quality Checklist

Before submitting code, ensure:

Code is formatted with autopep8: poetry run autopep8 --diff --aggressive --aggressive --recursive .
Imports are sorted: poetry run isort --check-only .
No linting errors: poetry run flake8 katabatic/
Type hints pass checking: poetry run mypy katabatic/
All tests pass: poetry run pytest
Documentation is updated if needed
Commit messages follow conventional commit format

Naming Conventions

Variables and Functions: snake_case
Classes: PascalCase
Constants: UPPER_SNAKE_CASE
Private Methods: _leading_underscore
Modules: lowercase or snake_case

Documentation Standards

Use Google-style docstrings for consistency
Include type information in docstrings when not obvious from type hints
Provide examples for complex functions
Update README and documentation when adding new features

Example Docstring:

def generate_synthetic_data(
    model: BaseModel,
    n_samples: int,
    temperature: float = 0.7
) -> pd.DataFrame:
    """Generate synthetic tabular data using the specified model.

    Args:
        model: Trained generative model instance
        n_samples: Number of synthetic samples to generate
        temperature: Sampling temperature for generation (default: 0.7)

    Returns:
        DataFrame containing synthetic data samples

    Raises:
        ValueError: If model is not trained or n_samples <= 0

    Example:
        >>> model = GANBLR()
        >>> model.fit(X_train, y_train)
        >>> synthetic_data = generate_synthetic_data(model, 1000)
    """

Testing Standards

Write unit tests for new features
Maintain minimum 80% code coverage
Use descriptive test names
Include edge case testing
Mock external dependencies

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

GANBLR: Based on the GAN-based Bayesian Learning Rules methodology
GReaT: Implements Generation of Realistic Tabular data using transformer models
Contributors: Thanks to all contributors who have helped improve this project

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: vikumdabare@gmail.com

🔗 Related Projects

Happy generating! 🎯

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0a1 pre-release

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

katabatic_test-0.1.0a1.tar.gz (252.7 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

katabatic_test-0.1.0a1-py3-none-any.whl (271.8 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file katabatic_test-0.1.0a1.tar.gz.

File metadata

Download URL: katabatic_test-0.1.0a1.tar.gz
Upload date: May 21, 2026
Size: 252.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for katabatic_test-0.1.0a1.tar.gz
Algorithm	Hash digest
SHA256	`fdf373b1f4a6b4ecf7edce193c065d649cb3cda3d6da6d9f1af10d631d5a3e88`
MD5	`63b5c8033aa5fecb41f8b90e3a964d7c`
BLAKE2b-256	`06db2f8a455002259f5bc134e6cf9842d42816af892d5b0897519f208e6c367d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for katabatic_test-0.1.0a1.tar.gz:

Publisher: publish-pypi.yml on datascience-works/Katabatic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: katabatic_test-0.1.0a1.tar.gz
- Subject digest: fdf373b1f4a6b4ecf7edce193c065d649cb3cda3d6da6d9f1af10d631d5a3e88
- Sigstore transparency entry: 1592673209
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: datascience-works/Katabatic@efc4ba2bfb64327dfec0bbbd227e9c505fccfb08
- Branch / Tag: refs/heads/main
- Owner: https://github.com/datascience-works
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@efc4ba2bfb64327dfec0bbbd227e9c505fccfb08
- Trigger Event: workflow_dispatch

File details

Details for the file katabatic_test-0.1.0a1-py3-none-any.whl.

File metadata

Download URL: katabatic_test-0.1.0a1-py3-none-any.whl
Upload date: May 21, 2026
Size: 271.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for katabatic_test-0.1.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c3e599f1bb096f26e55718c4678d26552dbbecca122666f0b400de8393e7283`
MD5	`c8809ecb1081b8ed99898ddbada23dff`
BLAKE2b-256	`1412629e87bc4abed3a0484148de2e34ec21a43598f8f22ac874a242af354328`

See more details on using hashes here.

Provenance

The following attestation bundles were made for katabatic_test-0.1.0a1-py3-none-any.whl:

Publisher: publish-pypi.yml on datascience-works/Katabatic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: katabatic_test-0.1.0a1-py3-none-any.whl
- Subject digest: 4c3e599f1bb096f26e55718c4678d26552dbbecca122666f0b400de8393e7283
- Sigstore transparency entry: 1592673227
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: datascience-works/Katabatic@efc4ba2bfb64327dfec0bbbd227e9c505fccfb08
- Branch / Tag: refs/heads/main
- Owner: https://github.com/datascience-works
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@efc4ba2bfb64327dfec0bbbd227e9c505fccfb08
- Trigger Event: workflow_dispatch

katabatic-test 0.1.0a1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Katabatic

🚀 Features

📋 Table of Contents

🔧 Prerequisites

System Requirements

Required Tools

1. Python Version Management with pyenv

2. Install Python 3.11

3. Package Management with Poetry

📦 Installation

1. Clone the Repository

2. Set Python Version

3. Install Dependencies

4. GPU Support (Optional)

5. Verify Installation

🚀 Quick Start

Artifact pipeline (recommended)

Legacy directory layout

Jupyter Notebook

📖 Usage

Data Preprocessing

Training Models

GANBLR Model

GReaT Model

Pipeline Usage

🤖 Models

GANBLR (GAN-based Bayesian Learning Rules)

GReaT (Generation of Realistic Tabular Data)

📊 Evaluation

TSTR (Train on Synthetic, Test on Real)

🛠 Development

Recommended VS Code Extensions

Development Setup

Project Structure

Building from Source

🤝 Contributing

Development Workflow

Code Standards & Style Guide

Python Style Guidelines

Code Formatting with autopep8

Recommended autopep8 Configuration

Additional Formatting Tools

Pre-commit Hooks

VS Code Configuration

Code Quality Checklist

Naming Conventions

Documentation Standards

Testing Standards

📄 License

🙏 Acknowledgments

📞 Support

🔗 Related Projects

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata