Skip to main content

Kubeflow Python SDK to manage ML workloads and to interact with Kubeflow APIs.

Project description

Kubeflow SDK

PyPI version PyPI Downloads Join Slack Coverage Status Ask DeepWiki

Latest News 🔥

Overview

The Kubeflow SDK is a set of unified Pythonic APIs that let you run any AI workload at any scale – without the need to learn Kubernetes. It provides simple and consistent APIs across the Kubeflow ecosystem, enabling users to focus on building AI applications rather than managing complex infrastructure.

Kubeflow SDK Benefits

  • Unified Experience: Single SDK to interact with multiple Kubeflow projects through consistent Python APIs
  • Simplified AI Workloads: Abstract away Kubernetes complexity and work effortlessly across all Kubeflow projects using familiar Python APIs
  • Built for Scale: Seamlessly scale any AI workload — from local laptop to large-scale production cluster with thousands of GPUs using the same APIs.
  • Rapid Iteration: Reduced friction between development and production environments
  • Local Development: First-class support for local development without a Kubernetes cluster requiring only pip installation
Kubeflow SDK Diagram

Kubeflow SDK Introduction

The following KubeCon + CloudNativeCon 2025 talk provides an overview of Kubeflow SDK:

Kubeflow SDK

Additionally, check out these demos to deep dive into Kubeflow SDK capabilities:

Get Started

Install Kubeflow SDK

pip install -U kubeflow

Run your first PyTorch distributed job

from kubeflow.trainer import TrainerClient, CustomTrainer, TrainJobTemplate

def get_torch_dist(learning_rate: str, num_epochs: str):
    import os
    import torch
    import torch.distributed as dist

    dist.init_process_group(backend="gloo")
    print("PyTorch Distributed Environment")
    print(f"WORLD_SIZE: {dist.get_world_size()}")
    print(f"RANK: {dist.get_rank()}")
    print(f"LOCAL_RANK: {os.environ['LOCAL_RANK']}")

    lr = float(learning_rate)
    epochs = int(num_epochs)
    loss = 1.0 - (lr * 2) - (epochs * 0.01)

    if dist.get_rank() == 0:
        print(f"loss={loss}")

# Create the TrainJob template
template = TrainJobTemplate(
    runtime="torch-distributed",
    trainer=CustomTrainer(
        func=get_torch_dist,
        func_args={"learning_rate": "0.01", "num_epochs": "5"},
        num_nodes=3,
        resources_per_node={"cpu": 2},
    ),
)

# Create the TrainJob
job_id = TrainerClient().train(**template)

# Wait for TrainJob to complete
TrainerClient().wait_for_job_status(job_id)

# Print TrainJob logs
print("\n".join(TrainerClient().get_job_logs(name=job_id)))

Optimize hyperparameters for your training

from kubeflow.optimizer import OptimizerClient, Search, TrialConfig

# Create OptimizationJob with the same template
optimization_id = OptimizerClient().optimize(
    trial_template=template,
    trial_config=TrialConfig(num_trials=10, parallel_trials=2),
    search_space={
        "learning_rate": Search.loguniform(0.001, 0.1),
        "num_epochs": Search.choice([5, 10, 15]),
    },
)

print(f"OptimizationJob created: {optimization_id}")

Run data processing with Spark Connect

Install Kubeflow Spark support:

pip install 'kubeflow[spark]'

To install the Spark Operator, see the installation guide.

from kubeflow.spark import KubernetesBackendConfig, SparkClient

client = SparkClient(KubernetesBackendConfig(namespace="spark-test"))
spark = client.connect()

df = spark.range(5)
df.show()

You should see the DataFrame:

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

You can also configure number of executors and resources:

spark = client.connect(
    num_executors=5,
    resources_per_executor={"cpu": "5", "memory": "1Gi"},
)

df = spark.range(5)
df.show()

Manage models with Model Registry

Install Model Registry support:

pip install 'kubeflow[hub]'

To install the Model Registry server, see the installation guide.

from kubeflow.hub import ModelRegistryClient

client = ModelRegistryClient("https://model-registry.kubeflow.svc.cluster.local", author="Your Name")

# Register a model
model = client.register_model(
    name="my-model",
    uri="s3://bucket/path/to/model",
    version="v1.0.0",
    model_format_name="pytorch",
    model_format_version="2.0",
    version_description="My trained model"
)

# Get a registered model
model = client.get_model("my-model")

# List all models
for model in client.list_models():
    print(f"Model: {model.name}")

# List model versions
for version in client.list_model_versions("my-model"):
    print(f"Version: {version.name}")

You can also initialize the client using different port configurations:

ModelRegistryClient("https://example.org", port=456)  # Explicit port argument
ModelRegistryClient("https://example.org:456")        # Port parsed from base_url
ModelRegistryClient("https://example.org")            # Default port (443 for https, 8080 for http)

Local Development

Kubeflow Trainer client supports local development without needing a Kubernetes cluster.

Available Backends

  • KubernetesBackend (default) - Production training on Kubernetes
  • ContainerBackend - Local development with Docker/Podman isolation
  • LocalProcessBackend - Quick prototyping with Python subprocesses

Quick Start: Install container support: pip install kubeflow[docker] or pip install kubeflow[podman]

from kubeflow.trainer import TrainerClient, ContainerBackendConfig, CustomTrainer

# Switch to local container execution
client = TrainerClient(backend_config=ContainerBackendConfig())

# Your training runs locally in isolated containers
job_id = client.train(trainer=CustomTrainer(func=train_fn))

Supported Kubeflow Projects

Project Status Version Support Description
Kubeflow Trainer Available v2.0.0+ Train and fine-tune AI models with various frameworks
Kubeflow Katib Available v0.19.0+ Hyperparameter optimization
Kubeflow Model Registry Available v0.3.0+ Manage model artifacts, versions and ML artifacts metadata
Kubeflow Spark Operator Available v2.5.0+ Manage Spark applications for data processing and feature engineering
Kubeflow Pipelines 🚧 Planned TBD Build, run, and track AI workflows
Feast 🚧 Planned TBD Feature store for machine learning

Community

Getting Involved

Contributing

Kubeflow SDK is a community project and is still under active development. We welcome contributions! Please see our CONTRIBUTING Guide for details.

Documentation

✨ Contributors

We couldn't have done it without these incredible people:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kubeflow-0.4.0.tar.gz (6.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kubeflow-0.4.0-py3-none-any.whl (186.7 kB view details)

Uploaded Python 3

File details

Details for the file kubeflow-0.4.0.tar.gz.

File metadata

  • Download URL: kubeflow-0.4.0.tar.gz
  • Upload date:
  • Size: 6.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kubeflow-0.4.0.tar.gz
Algorithm Hash digest
SHA256 cdb78e04031dfc6ed6ba8fe79f7e024503ce3b452a4f7a2add11a7c584a49ee1
MD5 f956dae77528a5f351b2c195161e9757
BLAKE2b-256 4e4d56385484f92842933bbefc26bb81adb578e58c68830971dd97cae7b39228

See more details on using hashes here.

Provenance

The following attestation bundles were made for kubeflow-0.4.0.tar.gz:

Publisher: release.yml on kubeflow/sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kubeflow-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: kubeflow-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 186.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kubeflow-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ca69870a30ecaa289a2dbb0c58ab29abe91a5f141adf22d28469ef9a0285d9c
MD5 8767e8ce4bc9f95d04a64587f8c44d3c
BLAKE2b-256 ba6f3ad603b744f322dad2321165417f2140bae81fd2fc57be5e5ce146a53354

See more details on using hashes here.

Provenance

The following attestation bundles were made for kubeflow-0.4.0-py3-none-any.whl:

Publisher: release.yml on kubeflow/sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page