Skip to main content

High efficiency Local/self-hosted ML Experiment Tracking System

Project description

KohakuBoard

High-performance ML experiment tracking with zero training overhead.

Ask DeepWiki

Part of KohakuHub - Self-hosted AI Infrastructure


Quick Start

pip install -e .
from kohakuboard.client import Board

board = Board(name="my-experiment", config={"lr": 0.001, "batch_size": 32})

# Training loop
for epoch in range(10):
    for data, target in train_loader:
        loss = train_step(data, target)

        board.step()  # Once per optimizer step
        board.log(loss=loss.item())  # Non-blocking, <0.1ms
        # Alternative: move board.step() after board.log() for 0-indexed steps

# logs are stored under ./kohakuboard using KohakuVault column stores + SQLite metadata

1761752427584 1761752450957

Join our community: https://discord.gg/xWYrkyvJ2s


Why KohakuBoard?

KohakuBoard's Advantages

  • Zero Training Overhead - Non-blocking logging returns in <0.1ms
  • Local-First - No server required during training, view results instantly
  • High Throughput - 20,000+ metrics/second sustained
  • Rich Data Types - Scalars, images, videos, tables, histograms
  • WebGL Visualization - Handle 100K+ datapoints smoothly
  • Self-Hosted - Your data stays on your infrastructure

Features

Non-Blocking Architecture

Background Writer Process ensures training never waits:

Training Script          Background Process
     │                          │
     ├─ board.log(loss=0.5)     │
     │  └─> Queue.put()         │
     │      (<0.1ms return!)    │
     │                          ├─ Queue.get()
     ├─ Continue training...    ├─ Batch write
     │                          └─ Flush to disk

Performance:

  • Log call latency: <0.1ms
  • Throughput: 20,000+ metrics/sec
  • Queue capacity: 50,000 messages
  • Memory overhead: ~100-200 MB

Rich Data Types

Unified API for all data types - no step inflation:

board.log(
    loss=0.5,                           # Scalar
    sample_img=Media(image),            # Image
    predictions=Table(results),         # Table
    gradients=Histogram(grads)          # Histogram
)
# All logged at SAME step with 1 queue message!

Supported Types:

  • Scalars - Metrics, learning rates, accuracies
  • Media - Images (PNG/JPG), videos (MP4), audio (WAV)
  • Tables - Structured data with embedded images
  • Histograms - Weight/gradient distributions with compression (99.8% size reduction)

Three-Tier SQLite Storage Architecture

Powered by KohakuVault - A high-performance storage library with dual interfaces over SQLite:

Three Specialized SQLite Implementations:

1. KohakuVault KVault        2. KohakuVault ColumnVault     3. Standard SQLite
   (K-V Store)                   (Columnar Storage)             (Relational)
   ├─ Media blobs                ├─ Metrics                     ├─ Media metadata
   ├─ B+Tree index on K          ├─ Histograms                  ├─ Tables
   ├─ Content-addressable        ├─ Blob-based columnar         └─ Step info
   └─ .cache() for bulk ops      └─ Dynamic chunk growth

Why KohakuVault?

  • Zero dependencies - Single SQLite file, no external services
  • Simple deployment - Just .db files, no infrastructure
  • Dual-interface design - Dict-like for blobs, list-like for sequences
  • High performance - Native speed with Pythonic API
  • Memory efficient - Streaming support, dynamic chunk growth
  • True SWMR - Multiple readers, single writer via SQLite WAL

Why Three Tiers?

  • KVault: Optimized for blob storage with B+Tree index, content-addressable
  • ColumnVault: Optimized for append-heavy time-series with columnar layout
  • Standard SQLite: Optimized for structured metadata with ACID guarantees

Advanced Visualization

WebGL-Based Charts powered by Plotly.js:

  • Handle 100K+ datapoints smoothly
  • Configurable smoothing (EMA, MA, Gaussian)
  • X-axis selection (step, global_step, any metric)
  • Multi-metric overlays
  • Dark/light mode
  • Responsive design

Rich Viewers:

  • Histogram Navigator - Step-by-step distribution exploration
  • Media Viewer - Image grids, video playback
  • Table Viewer - Structured data with embedded images
  • Dashboard - Customizable metric layouts

Local-First Workflow

# Train locally
python train.py              # Logs to ./kohakuboard/

# View results (no server required!)
kobo open ./kohakuboard --browser

# Optional server for team sharing (requires kohakuboard_server)
kobo-serve --port 48889

No server setup, no configuration, no hassle.


Quick Start

Installation

pip install -e .

Basic Usage

from kohakuboard.client import Board

# Create board - automatically saves on program exit
board = Board(name="my-experiment", config={"lr": 0.001, "batch_size": 32})

# Training loop
for epoch in range(10):
    for batch_idx, (data, target) in enumerate(train_loader):
        loss = train_step(data, target)

        # Increment step once per optimizer step (not per epoch!)
        board.step()

        # Log metrics (non-blocking, returns in <0.1ms)
        board.log(loss=loss.item(), lr=optimizer.param_groups[0]['lr'])

    # Log validation at end of epoch
    val_loss = validate(model, val_loader)
    board.log(**{"val/loss": val_loss})

# That's it! No .finish() needed - auto-cleanup via atexit

View Results

# Local viewer (no server)
kobo open ./kohakuboard --browser

# Or launch the authenticated server (requires kohakuboard_server)
kobo-serve --port 48889
# Drop/copy board folders into the configured data dir to share runs

Complete Example

from kohakuboard.client import Board, Histogram, Table, Media
import torch

# Create board with hyperparameters
board = Board(
    name="cifar10-resnet18",
    config={"lr": 0.001, "batch_size": 128, "epochs": 100, "optimizer": "AdamW"}
)

# Training loop
for epoch in range(100):
    model.train()
    for data, target in train_loader:
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        # Step once per optimizer step
        board.step()

        # Log scalars (non-blocking, <0.1ms)
        board.log(loss=loss.item(), lr=optimizer.param_groups[0]['lr'])

    # Validation
    model.eval()
    val_loss, correct, predictions_table = 0, 0, []

    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(val_loader):
            output = model(data)
            val_loss += criterion(output, target).item()
            pred = output.argmax(dim=1)
            correct += (pred == target).sum().item()

            # Sample predictions for table (first batch only)
            if batch_idx == 0:
                for i in range(min(8, len(data))):
                    predictions_table.append({
                        "image": Media(data[i].cpu().numpy()),
                        "true": class_names[target[i]],
                        "pred": class_names[pred[i]],
                        "correct": "✓" if pred[i] == target[i] else "✗"
                    })

    # Log validation (scalars + table + histograms - all at same step!)
    hist_data = {f"grad/{n}": Histogram(p.grad) for n, p in model.named_parameters() if p.grad is not None}
    board.log(**{
        "val/loss": val_loss / len(val_loader),
        "val/accuracy": correct / len(val_loader.dataset),
        "val/predictions": Table(predictions_table),
        **hist_data
    })

# No .finish() needed - automatic cleanup when script exits

Architecture

Client (Training Script)

Main Process (Training)          Background Writer Process
       │                                   │
       ├─ board.log(loss=0.5)              │
       │  └─> Queue.put()                  │
       │      (returns instantly!)         │
       │                                   ├─ Queue.get()
       │                                   ├─ Process batch
       ├─ Continue training...             ├─ Write to storage
       │                                   └─ Flush to disk

Key Features:

  • Non-blocking: log() returns in <0.1ms
  • Message Queue: 50,000 message capacity
  • Writer Process: Background process drains queue
  • Storage Layer: Three-tier SQLite architecture (KohakuVault KVault + ColumnVault + Standard SQLite)
  • Graceful Shutdown: atexit hooks + signal handlers ensure no data loss

Backend (Visualization Server)

FastAPI Backend (Port 48889)
    ↓ Read-only connections
Board Files (./kohakuboard/)
    ├── {board_id}/
    │   ├── metadata.json
    │   ├── data/           ← SQL/columnar queries here
    │   │   ├── metrics/    ← KohakuVault DB files
    │   │   └── metadata.db ← SQLite database
    │   └── media/
    │       └── *.png, *.mp4
        ↓ REST API
Vue 3 Frontend (WebGL Charts)

Key Features:

  • Zero-copy serving: Reads files directly (no database)
  • Concurrent reads: Multiple connections supported
  • Fast queries: Columnar storage for metrics
  • Static serving: Media files served directly

Data Model

Directory Structure

./kohakuboard/
└── {board_id_timestamp}/
    ├── metadata.json           # Board info, config, timestamps
    ├── data/                   # Storage backend files
    │   ├── metrics/            # (hybrid) KohakuVault columnar files
    │   │   ├── train__loss.db
    │   │   ├── val__accuracy.db
    │   │   └── ...
    │   ├── metadata.db         # (hybrid) SQLite metadata
    │   └── histograms/
    │       ├── gradients_i32.db  # int32 precision
    │       └── params_u8.db      # uint8 precision (compact)
    ├── media/                  # Content-addressed storage
    │   ├── {name}_{idx}_{step}_{sha256}.png
    │   ├── {name}_{idx}_{step}_{sha256}.mp4
    │   └── {name}_{idx}_{step}_{sha256}.wav
    └── logs/
        ├── output.log          # Captured stdout/stderr
        └── writer.log          # Writer process logs

Metadata Schema

{
  "board_id": "20250129_150423_abc123",
  "name": "cifar10-resnet18",
  "config": {
    "lr": 0.001,
    "batch_size": 128,
    "epochs": 100
  },
  "created_at": "2025-01-29T15:04:23",
  "finished_at": "2025-01-29T18:32:45",
  "status": "finished",
  "version": "0.0.1"
}

Manual Sync / Remote Sharing

Both the training-side package (kohakuboard) and the optional server (kohakuboard_server) read the exact same directory layout. To move a run between machines:

  1. Copy the entire board folder ({base_dir}/{project}/{board_id}) using cp, rsync, or any file transfer tool.
  2. Drop it into the destination data directory (the folder you pass to kobo open ... or the directory configured via KOHAKU_BOARD_DATA_DIR / --data-dir on kobo-serve).
  3. Restart the viewer or refresh the UI. The new run is immediately available.

No export/import step is required because metrics, metadata, tensors, and media already live in KohakuVault + SQLite files. The legacy kobo sync command still expects a DuckDB export and will fail on modern boards—use manual copy until the new sync API lands.


CLI Tool

# Open local viewer (no server)
kobo open ./kohakuboard --browser

# Start authenticated server (kohakuboard_server package)
kobo-serve --port 48889 --host 0.0.0.0

# Manual sync (recommended today): copy the entire board folder into the server's data dir
# (kobo sync is still wired to the legacy DuckDB exporter and will error on modern boards)

Configuration

Basic Usage

# All boards use KohakuVault + SQLite (no backend parameter needed)
board = Board(name="my-experiment", project="vision")

Advanced Options

board = Board(
    name="experiment",
    board_id="custom-id",           # Auto-generated if not provided
    config={"lr": 0.001},           # Hyperparameters
    project="custom-project",       # Sub-directory inside base_dir
    base_dir="./my-boards",         # Custom directory
    capture_output=True,            # Capture stdout/stderr
    remote_url="https://board.example.com",  # Optional future sync target
    remote_token="...",             # Token for remote sync (WIP)
    sync_enabled=False,             # Enable when remote endpoints are ready
    memory_mode=False,              # Keep data in RAM (requires sync to persist)
    annotation="debug-run",         # Suffix appended to run directory name
)

Storage Architecture:

  • KohakuVault KVault: Media blobs (K-V table with B+Tree index)
  • KohakuVault ColumnVault: Metrics/histograms (blob-based columnar)
  • Standard SQLite: Metadata (traditional relational tables)

Context Manager

with Board(name="experiment") as board:
    board.log(loss=0.5)
    # Automatic flush() and finish() on exit

API Reference

Board

Board(
    name: str | None = None,
    board_id: str | None = None,
    config: dict | None = None,
    project: str | None = None,
    base_dir: str | Path | None = None,
    capture_output: bool = True,
    remote_url: str | None = None,
    remote_token: str | None = None,
    remote_project: str | None = None,
    sync_enabled: bool = False,
    sync_interval: int = 10,
    memory_mode: bool = False,
    *,
    annotation: str | None = None,
)

Methods:

board.step() - Increment global_step

for batch_idx, batch in enumerate(train_loader):
    loss = train_step(batch)
    board.step()  # Increment ONCE per optimizer step
    board.log(**{"train/loss": loss, "train/lr": scheduler.get_last_lr()[0]})

board.log(**metrics) - Log data (non-blocking)

board.log(
    loss=0.5,
    accuracy=0.95,
    learning_rate=0.001,            # Scalars
    sample=Media(image_array),      # Images/video/audio
    predictions=Table(rows),        # Tables (optionally with Media)
    grad_norm=Histogram(values),    # Histograms
)

# Namespaces (creates tabs in UI)
board.log(**{
    "train/loss": 0.5,
    "val/accuracy": 0.95
})

# Tensor + KDE payloads (specialized viewers)
board.log(
    attention_tensor=TensorLog(tensor),
    density=KernelDensity(values, grid_size=256),
)

board.flush() - Force flush (blocks until complete)

board.flush()  # Wait for all pending writes

board.finish() - Manual cleanup (auto-called on exit)

board.finish()  # Flush buffers, close connections

Data Types

Media

from kobo.client.types import Media

# Images
board.log(
    sample_img=Media(image_array),  # numpy, PIL, torch tensor
    prediction=Media(pred_img, caption="Predicted: cat")
)

# Video
board.log(
    training_video=Media("output.mp4", media_type="video")
)

# Audio (if supported)
board.log(
    audio_sample=Media("sample.wav", media_type="audio")
)

Table

from kobo.client.types import Table

# From list of dicts
results = Table([
    {"name": "Alice", "score": 95, "pass": True},
    {"name": "Bob", "score": 87, "pass": True},
])
board.log(student_results=results)

# Tables with embedded images
predictions = Table([
    {"image": Media(img), "label": "cat", "confidence": 0.95},
    {"image": Media(img2), "label": "dog", "confidence": 0.87},
])
board.log(val_predictions=predictions)

Histogram

from kobo.client.types import Histogram

# Log gradient distributions
board.log(
    gradients=Histogram(param.grad),
    weights=Histogram(param.data)
)

# Precompute for efficiency (optional)
hist = Histogram(gradients).compute_bins()
board.log(grad_distribution=hist)

# Compact precision (75% size reduction, ~1% accuracy loss)
hist = Histogram(gradients, precision="compact")
board.log(grad_distribution=hist)

Deployment

Local Mode (Recommended)

# Install
pip install -e .

# Train
python train.py

# View results
kobo open ./kohakuboard --browser

Remote Mode (WIP)

# Run the authenticated server (still stabilizing)
kobo-serve --data-dir /var/kohakuboard --db sqlite:///kohakuboard.db

# Share boards by copying folders into /var/kohakuboard/<project>/
# Restart/reload the server to pick up new runs

See docs/kohakuboard/ for complete deployment guides.


Comparison with Alternatives

Feature WandB TensorBoard MLflow KohakuBoard
Latency ~10ms ~1ms ~5ms <0.1ms
Throughput ~1K/sec ~10K/sec ~5K/sec 20K+/sec
Offline ❌ No ✅ Yes ✅ Yes ✅ Yes
File-Based ❌ No ✅ Yes ❌ No ✅ Yes
Non-Blocking ❌ No ❌ No ❌ No ✅ Yes
Columnar Reads ❌ No ❌ No ✅ Yes ✅ Yes (KohakuVault ColumnVault)
WebGL Charts ❌ No ❌ No ❌ No ✅ Yes
100K+ Points Slow Slow Slow Fast
Self-Hosted Limited ✅ Yes ✅ Yes ✅ Yes
Setup Cloud Local Server None

Documentation


Examples

See examples/ directory:

  • kohakuboard_basic.py - Simple scalar logging
  • kohakuboard_all_media_types.py - Images, videos, tables
  • kohakuboard_cifar_training.py - Complete CIFAR-10 training example
  • kohakuboard_media_in_tables.py - Tables with embedded images
  • kohakuboard_histogram_logging.py - Gradient distribution tracking

Roadmap

✅ Complete

Client Library:

  • Non-blocking logging architecture
  • Rich data types (scalars, media, tables, histograms)
  • Three-tier SQLite architecture (KohakuVault KVault + ColumnVault + Standard SQLite)
  • Graceful shutdown with queue draining
  • Content-addressed media storage

Backend & UI:

  • FastAPI REST API
  • Vue 3 interface with dark/light mode
  • WebGL charts (100K+ points)
  • Histogram navigator
  • Media/table viewers
  • CLI tool (kobo)

🚧 In Progress

  • Remote server mode with authentication
  • Sync protocol for uploading local boards
  • Project management (group related boards)
  • Run comparison UI (side-by-side metrics)
  • Real-time streaming (live updates while training)

📋 Planned

Client Features:

  • PyTorch Lightning integration
  • Keras callback
  • Hugging Face Trainer integration
  • Custom callback system

Backend Features:

  • Multi-board comparison API
  • Advanced filtering (tags, date range)
  • Export to CSV/JSON
  • Aggregation queries

UI Features:

  • Diff viewer (compare runs)
  • Scatter plots (metric vs metric)
  • Custom dashboards
  • Annotations
  • Search and filter

Infrastructure:

  • Docker/Kubernetes deployment
  • Cloud storage backends (S3, GCS)
  • Multi-user authentication

License

KohakuBoard is a multi-component project with different licenses:

  • Client Library (kohakuboard): Apache License 2.0

    • Free for commercial and non-commercial use
    • Permissive license with minimal requirements
  • Web UI (kohaku-board-ui): AGPL-3.0

    • Free to use and modify
    • Source code disclosure required for network services
  • Server (kohakuboard_server): Kohaku Software License 1.0

    • Free for non-commercial use
    • Free for commercial use under revenue/duration limits
    • Commercial licenses available for larger deployments

Commercial Licensing: For commercial licenses or exemptions, contact kohaku@kblueleaf.net

See LICENSE for complete details.


Contributing

KohakuBoard is part of the KohakuHub ecosystem. We welcome contributions!

Before contributing:

Areas we need help:

  • 🎨 Frontend (chart improvements, UI/UX)
  • 🔧 Backend (storage backends, performance)
  • 📊 Client library (framework integrations)
  • 📚 Documentation (tutorials, guides)
  • 🧪 Testing (unit tests, benchmarks)

Support


Acknowledgments

  • KohakuVault - High-performance storage library with dual SQLite interfaces (KVault for blobs, ColumnVault for sequences)
  • Plotly.js - WebGL charts
  • Vue 3 - Modern UI framework
  • FastAPI - Backend framework

Production Ready! Core features are stable and performant. Use in real training workflows and help us improve.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kohakuboard-0.0.1-py3-none-any.whl (2.8 MB view details)

Uploaded Python 3

File details

Details for the file kohakuboard-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: kohakuboard-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for kohakuboard-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f25d8cf8c02e0e5552fac5efa1677ab52e3100868d4f0be0a62f10fb365d39a1
MD5 17a0db4588cb60437685a2445add7112
BLAKE2b-256 756375f255f29ebeec05764561924b216c8c45073c776008fc16b91236ce4987

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page