Skip to main content

Keep It Simple Stupid Tools for Machine Learning

Project description

kissml

Keep It Simple Stupid Tools for Machine Learning

A Python library providing simple, powerful tools for ML workflows with minimal boilerplate.

I made this because:

  • Most data science services are notebook based, but notebooks are difficult to debug
  • Most frameworks (flyte, metaflow) focus on extending to the cloud. This is great, but for local iteration all we really need is reproducible pipeline steps.

Installation

pip install kissml

Steps

The @step decorator provides:

  • execution tracking
  • persistent disk-based caching for your functions
  • post-run execution (i.e., after effects) for the return value -- useful to visualize data or log stats.

Basic Usage

from kissml import step, CacheConfig
import logging

# Simple execution time logging
@step(log_level=logging.INFO)
def process_data(data):
    # Your processing logic here
    return result

# With persistent caching
@step(
    log_level=logging.INFO,
    cache=CacheConfig(version=1)
)
def expensive_computation(data):
    # This will only run once per unique input
    # Subsequent calls return cached results
    return result

Key Features

Execution Time Tracking: Log how long your functions take to run

@step(log_level=logging.INFO)
def train_model(X, y):
    # Logs: "train_model completed in 45.2341 seconds"
    return model

Persistent Disk Caching: Cache results to disk and reuse them across runs

@step(cache=CacheConfig(version=1))
def load_and_preprocess(filepath):
    # Expensive preprocessing runs once
    # Subsequent calls load from cache in milliseconds
    return processed_data

Version-Based Invalidation: Bump the version to invalidate old cache

# Old implementation
@step(cache=CacheConfig(version=1))
def feature_engineering(df):
    return old_features(df)

# Updated implementation - cache automatically invalidated
@step(cache=CacheConfig(version=2))
def feature_engineering(df):
    return new_improved_features(df)

Smart Serialization: Efficient storage for pandas DataFrames and nested collections

import pandas as pd

@step(cache=CacheConfig(version=1))
def analyze_data(df: pd.DataFrame) -> pd.DataFrame:
    # DataFrames cached as Parquet files (requires pyarrow)
    # Much more efficient than pickle
    return processed_df

@step(cache=CacheConfig(version=1))
def complex_pipeline(data) -> dict:
    # Returns dict with DataFrames, lists, etc.
    # Each type uses optimal serialization
    return {
        "results": some_dataframe,
        "metrics": [metric1, metric2],
        "metadata": {"key": "value"}
    }

Cache Configuration

Control cache behavior with CacheConfig:

from kissml import step, CacheConfig, EvictionPolicy

# No eviction (default) - cache grows forever
@step(cache=CacheConfig(version=1, eviction_policy=EvictionPolicy.NONE))
def permanent_cache(x):
    return x

# Least Recently Used - evicts oldest accessed items
@step(cache=CacheConfig(version=1, eviction_policy=EvictionPolicy.LEAST_RECENTLY_USED))
def lru_cache(x):
    return x

# Least Recently Stored - evicts oldest stored items
@step(cache=CacheConfig(version=1, eviction_policy=EvictionPolicy.LEAST_RECENTLY_STORED))
def lrs_cache(x):
    return x

# Least Frequently Used - evicts least accessed items
@step(cache=CacheConfig(version=1, eviction_policy=EvictionPolicy.LEAST_FREQUENTLY_USED))
def lfu_cache(x):
    return x

AfterEffects

AfterEffects allow you to automatically execute side effects (like visualization, logging, or validation) after a step completes, whether the result was cached or freshly computed.

from typing import Annotated
from kissml import step, AfterEffect, CacheConfig
import mlflow

# Define a custom AfterEffect
class HTMLVisualizer(AfterEffect):
    def __init__(self, max_rows=100):
        self.max_rows = max_rows
    
    def __call__(self, result, was_cached, func_name, execution_time):
        # Create HTML preview
        html = result.head(self.max_rows).to_html()
        html = f"<h3>{func_name} - {execution_time:.2f}s {'(cached)' if was_cached else ''}</h3>" + html
        
        # Log to MLflow
        with open(f"{func_name}.html", "w") as f:
            f.write(html)
        mlflow.log_artifact(f"{func_name}.html")

# Use it with type annotations
@step(cache=CacheConfig(version=1))
def load_data() -> Annotated[pd.DataFrame, HTMLVisualizer(max_rows=200)]:
    return pd.read_csv("data.csv")

# Multiple effects run left-to-right
class DatasetLogger(AfterEffect):
    def __call__(self, result, was_cached, func_name, execution_time):
        if not was_cached:  # Only log once
            mlflow.log_metric(f"{func_name}_rows", len(result))

@step(cache=CacheConfig(version=1))
def process() -> Annotated[pd.DataFrame, DatasetLogger(), HTMLVisualizer()]:
    # Both effects run automatically after the function completes
    return load_data()

Error Handling: Control whether AfterEffect failures stop execution:

# Default: errors are logged but don't stop execution
@step(cache=CacheConfig(version=1))
def safe_pipeline() -> Annotated[pd.DataFrame, MyVisualizer()]:
    return data

# Strict mode: effect errors raise exceptions
@step(cache=CacheConfig(version=1), error_on_affect_failure=True)
def strict_pipeline() -> Annotated[pd.DataFrame, MyVisualizer()]:
    return data

Configuration

Configure the cache directory via environment variable or settings:

from kissml import settings
from pathlib import Path

# Set cache directory
settings.cache_directory = Path("/path/to/cache")

# Or use environment variable
# export KISSML_CACHE_DIRECTORY=/path/to/cache

Custom Serialization

Register custom serializers for your types:

from kissml.settings import settings
from kissml.types import Serializer
from typing import Any, BinaryIO

class MyCustomSerializer(Serializer):
    def serialize(self, value: Any, out: BinaryIO) -> None:
        # Your serialization logic
        pass

    def deserialize(self, input: BinaryIO) -> Any:
        # Your deserialization logic
        pass

# Register the serializer
settings.serialize_by_type[MyCustomType] = MyCustomSerializer()

# Register a hash function for cache keys
settings.hash_by_type[MyCustomType] = lambda obj: str(hash(obj))

License

Licensed under CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives). This is a non-commercial license - see the LICENSE file for full details.

For commercial use, please contact the author.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kissml-0.4.8.tar.gz (76.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kissml-0.4.8-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file kissml-0.4.8.tar.gz.

File metadata

  • Download URL: kissml-0.4.8.tar.gz
  • Upload date:
  • Size: 76.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kissml-0.4.8.tar.gz
Algorithm Hash digest
SHA256 e405b20f232a2dd10a3a4d01f83f89e8865a0396c5ccaaad322573ca983d1704
MD5 01b850f8dcd7860ae6d584d4251e32b9
BLAKE2b-256 dbfdae08f7e961370c9dd32ce3c2a0e78d0524d885eb65708cc082cd8fff9ea5

See more details on using hashes here.

Provenance

The following attestation bundles were made for kissml-0.4.8.tar.gz:

Publisher: publish.yml on lou-k/kissml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kissml-0.4.8-py3-none-any.whl.

File metadata

  • Download URL: kissml-0.4.8-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kissml-0.4.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f366d4a923b4c391c9b217899d02990036ede32235c86e55044c5f312274d736
MD5 19dfd13ba8434eac53b2329f746dd1b0
BLAKE2b-256 88e8928d9f9dd5e97bbc017c052695a76fc63848324725a2d490db34ec89196c

See more details on using hashes here.

Provenance

The following attestation bundles were made for kissml-0.4.8-py3-none-any.whl:

Publisher: publish.yml on lou-k/kissml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page