Keep It Simple Stupid Tools for Machine Learning
Project description
kissml
Keep It Simple Stupid Tools for Machine Learning
A Python library providing simple, powerful tools for ML workflows with minimal boilerplate.
I made this because:
- Most data science services are notebook based, but notebooks are difficult to debug
- Most frameworks (flyte, metaflow) focus on extending to the cloud. This is great, but for local iteration all we really need is reproducible pipeline steps.
Installation
pip install kissml
Steps
The @step decorator provides:
- execution tracking
- persistent disk-based caching for your functions
- post-run execution (i.e., after effects) for the return value -- useful to visualize data or log stats.
Basic Usage
from kissml import step, CacheConfig
import logging
# Simple execution time logging
@step(log_level=logging.INFO)
def process_data(data):
# Your processing logic here
return result
# With persistent caching
@step(
log_level=logging.INFO,
cache=CacheConfig(version=1)
)
def expensive_computation(data):
# This will only run once per unique input
# Subsequent calls return cached results
return result
Key Features
Execution Time Tracking: Log how long your functions take to run
@step(log_level=logging.INFO)
def train_model(X, y):
# Logs: "train_model completed in 45.2341 seconds"
return model
Persistent Disk Caching: Cache results to disk and reuse them across runs
@step(cache=CacheConfig(version=1))
def load_and_preprocess(filepath):
# Expensive preprocessing runs once
# Subsequent calls load from cache in milliseconds
return processed_data
Version-Based Invalidation: Bump the version to invalidate old cache
# Old implementation
@step(cache=CacheConfig(version=1))
def feature_engineering(df):
return old_features(df)
# Updated implementation - cache automatically invalidated
@step(cache=CacheConfig(version=2))
def feature_engineering(df):
return new_improved_features(df)
Smart Serialization: Efficient storage for pandas DataFrames and nested collections
import pandas as pd
@step(cache=CacheConfig(version=1))
def analyze_data(df: pd.DataFrame) -> pd.DataFrame:
# DataFrames cached as Parquet files (requires pyarrow)
# Much more efficient than pickle
return processed_df
@step(cache=CacheConfig(version=1))
def complex_pipeline(data) -> dict:
# Returns dict with DataFrames, lists, etc.
# Each type uses optimal serialization
return {
"results": some_dataframe,
"metrics": [metric1, metric2],
"metadata": {"key": "value"}
}
Cache Configuration
Control cache behavior with CacheConfig:
from kissml import step, CacheConfig, EvictionPolicy
# No eviction (default) - cache grows forever
@step(cache=CacheConfig(version=1, eviction_policy=EvictionPolicy.NONE))
def permanent_cache(x):
return x
# Least Recently Used - evicts oldest accessed items
@step(cache=CacheConfig(version=1, eviction_policy=EvictionPolicy.LEAST_RECENTLY_USED))
def lru_cache(x):
return x
# Least Recently Stored - evicts oldest stored items
@step(cache=CacheConfig(version=1, eviction_policy=EvictionPolicy.LEAST_RECENTLY_STORED))
def lrs_cache(x):
return x
# Least Frequently Used - evicts least accessed items
@step(cache=CacheConfig(version=1, eviction_policy=EvictionPolicy.LEAST_FREQUENTLY_USED))
def lfu_cache(x):
return x
AfterEffects
AfterEffects allow you to automatically execute side effects (like visualization, logging, or validation) after a step completes, whether the result was cached or freshly computed.
from typing import Annotated
from kissml import step, AfterEffect, CacheConfig
import mlflow
# Define a custom AfterEffect
class HTMLVisualizer(AfterEffect):
def __init__(self, max_rows=100):
self.max_rows = max_rows
def __call__(self, result, was_cached, func_name, execution_time):
# Create HTML preview
html = result.head(self.max_rows).to_html()
html = f"<h3>{func_name} - {execution_time:.2f}s {'(cached)' if was_cached else ''}</h3>" + html
# Log to MLflow
with open(f"{func_name}.html", "w") as f:
f.write(html)
mlflow.log_artifact(f"{func_name}.html")
# Use it with type annotations
@step(cache=CacheConfig(version=1))
def load_data() -> Annotated[pd.DataFrame, HTMLVisualizer(max_rows=200)]:
return pd.read_csv("data.csv")
# Multiple effects run left-to-right
class DatasetLogger(AfterEffect):
def __call__(self, result, was_cached, func_name, execution_time):
if not was_cached: # Only log once
mlflow.log_metric(f"{func_name}_rows", len(result))
@step(cache=CacheConfig(version=1))
def process() -> Annotated[pd.DataFrame, DatasetLogger(), HTMLVisualizer()]:
# Both effects run automatically after the function completes
return load_data()
Error Handling: Control whether AfterEffect failures stop execution:
# Default: errors are logged but don't stop execution
@step(cache=CacheConfig(version=1))
def safe_pipeline() -> Annotated[pd.DataFrame, MyVisualizer()]:
return data
# Strict mode: effect errors raise exceptions
@step(cache=CacheConfig(version=1), error_on_affect_failure=True)
def strict_pipeline() -> Annotated[pd.DataFrame, MyVisualizer()]:
return data
Global AfterEffects: Register an AfterEffect once and have it fire after every @step call — no per-step annotation required. Useful for cross-cutting concerns like logging, persistence, or experiment tracking.
import logging
from kissml import settings, step, AfterEffect
class StepTimingLogger(AfterEffect):
"""Log every step's name, runtime, and cache status."""
def __call__(self, result, was_cached, func_name, execution_time):
status = "cached" if was_cached else "fresh"
logging.info(
f"{func_name} finished in {execution_time:.3f}s ({status})"
)
# Register once — fires for every @step call from now on
settings.global_after_effects.append(StepTimingLogger())
@step()
def load_data() -> pd.DataFrame:
return pd.read_csv("data.csv") # StepTimingLogger runs after this returns
@step()
def transform(df: pd.DataFrame) -> pd.DataFrame:
return df.dropna() # StepTimingLogger runs here too
Per-step effects (declared in the return annotation) fire first, then global effects. Both honor the error_on_effect_failure flag on the step.
Configuration
Configure the cache directory via environment variable or settings:
from kissml import settings
from pathlib import Path
# Set cache directory
settings.cache_directory = Path("/path/to/cache")
# Or use environment variable
# export KISSML_CACHE_DIRECTORY=/path/to/cache
Custom Serialization
Register custom serializers for your types:
from kissml.settings import settings
from kissml.types import Serializer
from typing import Any, BinaryIO
class MyCustomSerializer(Serializer):
def serialize(self, value: Any, out: BinaryIO) -> None:
# Your serialization logic
pass
def deserialize(self, input: BinaryIO) -> Any:
# Your deserialization logic
pass
# Register the serializer
settings.serialize_by_type[MyCustomType] = MyCustomSerializer()
# Register a hash function for cache keys
settings.hash_by_type[MyCustomType] = lambda obj: str(hash(obj))
License
Licensed under CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives). This is a non-commercial license - see the LICENSE file for full details.
For commercial use, please contact the author.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kissml-0.4.11.tar.gz.
File metadata
- Download URL: kissml-0.4.11.tar.gz
- Upload date:
- Size: 77.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d79bf4330104ae40d6526d6218bd1f3c5ecee76ef12277bf654e5d0134b0dbee
|
|
| MD5 |
45f3a5b58564ad0300689fa5ef14e685
|
|
| BLAKE2b-256 |
bbffc138338c86977ad938364115bde378a4c6abe051692316d2626fb80e79fc
|
Provenance
The following attestation bundles were made for kissml-0.4.11.tar.gz:
Publisher:
publish.yml on lou-k/kissml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kissml-0.4.11.tar.gz -
Subject digest:
d79bf4330104ae40d6526d6218bd1f3c5ecee76ef12277bf654e5d0134b0dbee - Sigstore transparency entry: 1409709500
- Sigstore integration time:
-
Permalink:
lou-k/kissml@0b9d74b25789aa944f67015af9af3827e7ee5392 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/lou-k
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0b9d74b25789aa944f67015af9af3827e7ee5392 -
Trigger Event:
workflow_run
-
Statement type:
File details
Details for the file kissml-0.4.11-py3-none-any.whl.
File metadata
- Download URL: kissml-0.4.11-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f065dba3e42f100c785d7c69c89ff2245e76a253fa06617ebb177858da634e2
|
|
| MD5 |
f6ca96ef4cd006b10b24938f354ef590
|
|
| BLAKE2b-256 |
8e7418c8d81f369cfea01e38bef87db586ecbb2dd8158aaddd6c7cc4397b7802
|
Provenance
The following attestation bundles were made for kissml-0.4.11-py3-none-any.whl:
Publisher:
publish.yml on lou-k/kissml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kissml-0.4.11-py3-none-any.whl -
Subject digest:
1f065dba3e42f100c785d7c69c89ff2245e76a253fa06617ebb177858da634e2 - Sigstore transparency entry: 1409709517
- Sigstore integration time:
-
Permalink:
lou-k/kissml@0b9d74b25789aa944f67015af9af3827e7ee5392 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/lou-k
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0b9d74b25789aa944f67015af9af3827e7ee5392 -
Trigger Event:
workflow_run
-
Statement type: