Pulumi EKS ML Infrastructure
Project description
Pulumi EKS ML Infrastructure
An opinionated infrastructure library for building scalable Machine Learning platforms on AWS.
💡 Why This Project?
Building ML infrastructure is complex. It's not just "spinning up a cluster"; it requires stitching together networking, compute, storage, GPU management, ingress, and observability into a cohesive platform.
Traditionally, teams face a choice:
- Monolithic "World" Repos: Everything in one giant Terraform state or Pulumi stack. Safe at first, but terrifying to update as it grows.
- Fragmented Scripts: A collection of disconnected scripts that are hard to replicate across environments (Dev vs Prod).
pulumi_eks_ml solves this by treating infrastructure as a composable library.
- Modular: Instead of one rigid architecture, you get building blocks (VPC, EKS, Karpenter, GPUs) to assemble your specific topology.
- Multi-Region Native: Seamlessly peer VPCs across regions for global inference or disaster recovery.
- ML-Optimized: We've pre-baked the hard stuff—GPU drivers, Karpenter autoscaling for Spot instances, and optimized node pools.
- Environment Parity: Define your topology once in code, then deploy it identicaly to Dev, Staging, and Prod using simple configuration.
📦 What's Inside?
The repository provides a Python package (pulumi_eks_ml) containing high-level, opinionated components:
🌐 Networking (vpc)
- Hub-and-Spoke Topology: Connect a central "Hub" VPC to multiple regional "Spoke" VPCs automatically.
- Routing: Handles the complex peering routes and security group rules for you.
🧠 Compute (eks)
- Secure EKS Clusters: Private endpoints, Fargate control planes, and OIDC identity providers pre-configured.
- Karpenter Autoscaling: The gold standard for ML compute. Automatically provisions GPU/CPU nodes based on pending pod demand. support for Spot instances to reduce costs.
🧩 Addons (eks_addons)
Ready-to-use integrations that turn a raw cluster into a platform:
- NvidiaDevicePlugin: Enable GPU workloads immediately.
- AlbController: AWS Application Load Balancer management for ingress.
- EbsCsi: AWS EBS CSI driver for block storage.
- EfsCsi: AWS EFS CSI driver for shared file storage (ideal for model weights).
- FluentBit: Ship logs to CloudWatch/S3/ES.
- MetricsServer: Essential for Horizontal Pod Autoscaling (HPA).
- Tailscale: Secure subnet router for private cluster access.
🚀 Applications (eks_apps)
- SkyPilot: Deploy the multi-cloud job orchestration server with one line of code.
🏗 How to Organize Your Infrastructure
We recommend an Independent Project structure. Treat this repo as a dependency (like a library), and build your actual infrastructure in separate project folders.
The Model: Projects & Stacks
- Project: Represents a specific Topology. (e.g., "Training Platform", "Model Serving").
- Stack: Represents an Environment for that topology. (e.g.,
dev,staging,prod).
This ensures that your "Training Platform" is completely isolated from your "Web App", but your dev training environment is an exact mirror of prod.
Directory Structure Example
.
├── pulumi_eks_ml/ # 📦 The Shared Library (Infrastructure Code)
├── pyproject.toml
│
├── projects/ # 🚀 Your Live Infrastructure
│ │
│ ├── ml-training-platform/ # PROJECT 1: Heavy GPU training
│ │ ├── __main__.py # Definition: VPC + EKS + GPU Pools
│ │ ├── Pulumi.dev.yaml # Config: Small instances, 1 region
│ │ └── Pulumi.prod.yaml # Config: P4d instances, 3 regions
│ │
│ └── model-inference-api/ # PROJECT 2: High-uptime CPU/Inf1 serving
│ ├── __main__.py # Definition: Multi-region VPC + EKS
│ ├── Pulumi.staging.yaml
│ └── Pulumi.prod.yaml
🛠 Getting Started
Prerequisites
- Pulumi CLI
- Python 3.12+
- uv (recommended) or pip
1. Install Dependencies
Install the library in your environment:
uv sync --dev
2. Create Your Project
Create a folder for your new infrastructure topology.
mkdir -p projects/my-ml-platform && cd projects/my-ml-platform
# Chose uv as the toolchain
uv run pulumi new python --name my-ml-platform --force
# Run 'uv add ../../. --editable' to add 'pulumi_eks_ml' as an editable dependency
uv add ../../. --editable
# Remove the requirements.txt and main.py files (unnecessary)
rm requirements.txt main.py
# Source the project's virtual environment (IMPORTANT!)
source .venv/bin/activate
Note that we created a starter project in projects/starter that you can use as a reference.
3. Initialize Environments
Create stacks for the environments you need to support.
pulumi stack init dev
pulumi stack init prod
4. Write Your Infrastructure Code
In projects/my-ml-platform/__main__.py, import the library and define your platform.
import pulumi
from pulumi_eks_ml import vpc, eks, eks_addons
# 1. Load Environment Config
cfg = pulumi.Config()
instance_type = cfg.require("gpuInstanceType")
env_name = pulumi.get_stack()
# 2. Define Networking
# Creates a VPC isolated to this environment
my_vpc = vpc.Vpc(f"{env_name}-vpc")
# 3. Define Compute
cluster = eks.EKSCluster(
f"{env_name}-cluster",
vpc_id=my_vpc.vpc_id,
subnet_ids=my_vpc.private_subnet_ids,
# Define Node Pools
node_pools=[
eks.NodePoolConfig(
name="gpu-workload",
instance_type=instance_type, # Injected from stack config!
capacity_type="spot", # Save money on training
)
],
)
# 3b. Enable Platform Services (install addons)
addon_installations = eks.cluster.EKSClusterAddonInstaller(
f"{env_name}-addons",
cluster=cluster,
addon_types=eks_addons.recommended_addons(),
)
# 4. Export Outputs
pulumi.export("kubeconfig", cluster.kubeconfig)
5. Configure & Deploy
Set the variables for your dev stack and deploy.
# Configure Dev
pulumi stack select dev
pulumi config set aws:region us-west-2
pulumi config set gpuInstanceType g5.xlarge
# Deploy
uv run pulumi up
🧪 Testing
We treat infrastructure code like software. The library includes tests you can run locally.
# Run Unit Tests (Fast, mocked AWS calls)
uv run pytest -vv tests/unit
# Run Integration Tests (Real provisioning against LocalStack - no need to start LocalStack manually)
uv run pytest -vv tests/integration
📄 License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pulumi_eks_ml-0.1.2.tar.gz.
File metadata
- Download URL: pulumi_eks_ml-0.1.2.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd7c3fb0a0e78347caa080c21c18fb9c0ddfe36bf6b78862bb338ba15fd3d2de
|
|
| MD5 |
8238fa39aacbade214574cbc71ab022c
|
|
| BLAKE2b-256 |
f16838cecabb37d83e13454a3724397a11e0ffd0f7dc67bd06e5c01d007a36fc
|
Provenance
The following attestation bundles were made for pulumi_eks_ml-0.1.2.tar.gz:
Publisher:
publish.yml on Roulbac/pulumi-eks-ml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pulumi_eks_ml-0.1.2.tar.gz -
Subject digest:
cd7c3fb0a0e78347caa080c21c18fb9c0ddfe36bf6b78862bb338ba15fd3d2de - Sigstore transparency entry: 870023631
- Sigstore integration time:
-
Permalink:
Roulbac/pulumi-eks-ml@8abf8c88a164c3411d0d43511b14df193cc3156e -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/Roulbac
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8abf8c88a164c3411d0d43511b14df193cc3156e -
Trigger Event:
release
-
Statement type:
File details
Details for the file pulumi_eks_ml-0.1.2-py3-none-any.whl.
File metadata
- Download URL: pulumi_eks_ml-0.1.2-py3-none-any.whl
- Upload date:
- Size: 36.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8c30326dea57bf3ebef04416d9f66494f9a29713c2ac2f944a43308155f54c9
|
|
| MD5 |
b8311f21b1127298a3371dfcef298e43
|
|
| BLAKE2b-256 |
599a627bd455408a3b0ecdc17efd470f7ce607bcfab98ba0890341362246f551
|
Provenance
The following attestation bundles were made for pulumi_eks_ml-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on Roulbac/pulumi-eks-ml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pulumi_eks_ml-0.1.2-py3-none-any.whl -
Subject digest:
c8c30326dea57bf3ebef04416d9f66494f9a29713c2ac2f944a43308155f54c9 - Sigstore transparency entry: 870023657
- Sigstore integration time:
-
Permalink:
Roulbac/pulumi-eks-ml@8abf8c88a164c3411d0d43511b14df193cc3156e -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/Roulbac
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8abf8c88a164c3411d0d43511b14df193cc3156e -
Trigger Event:
release
-
Statement type: