Skip to main content

Pulumi EKS ML Infrastructure

Project description

Pulumi EKS ML Infrastructure

Tests

An opinionated infrastructure library for building scalable Machine Learning platforms on AWS.


💡 Why This Project?

Building ML infrastructure is complex. It's not just "spinning up a cluster"; it requires stitching together networking, compute, storage, GPU management, ingress, and observability into a cohesive platform.

Traditionally, teams face a choice:

  1. Monolithic "World" Repos: Everything in one giant Terraform state or Pulumi stack. Safe at first, but terrifying to update as it grows.
  2. Fragmented Scripts: A collection of disconnected scripts that are hard to replicate across environments (Dev vs Prod).

pulumi_eks_ml solves this by treating infrastructure as a composable library.

  • Modular: Instead of one rigid architecture, you get building blocks (VPC, EKS, Karpenter, GPUs) to assemble your specific topology.
  • Multi-Region Native: Seamlessly peer VPCs across regions for global inference or disaster recovery.
  • ML-Optimized: We've pre-baked the hard stuff—GPU drivers, Karpenter autoscaling for Spot instances, and optimized node pools.
  • Environment Parity: Define your topology once in code, then deploy it identicaly to Dev, Staging, and Prod using simple configuration.

📦 What's Inside?

The repository provides a Python package (pulumi_eks_ml) containing high-level, opinionated components:

🌐 Networking (vpc)

  • Hub-and-Spoke Topology: Connect a central "Hub" VPC to multiple regional "Spoke" VPCs automatically.
  • Routing: Handles the complex peering routes and security group rules for you.

🧠 Compute (eks)

  • Secure EKS Clusters: Private endpoints, Fargate control planes, and OIDC identity providers pre-configured.
  • Karpenter Autoscaling: The gold standard for ML compute. Automatically provisions GPU/CPU nodes based on pending pod demand. support for Spot instances to reduce costs.

🧩 Addons (eks_addons)

Ready-to-use integrations that turn a raw cluster into a platform:

  • NvidiaDevicePlugin: Enable GPU workloads immediately.
  • AlbController: AWS Application Load Balancer management for ingress.
  • EbsCsi: AWS EBS CSI driver for block storage.
  • EfsCsi: AWS EFS CSI driver for shared file storage (ideal for model weights).
  • FluentBit: Ship logs to CloudWatch/S3/ES.
  • MetricsServer: Essential for Horizontal Pod Autoscaling (HPA).
  • Tailscale: Secure subnet router for private cluster access.

🚀 Applications (eks_apps)

  • SkyPilot: Deploy the multi-cloud job orchestration server with one line of code.

🏗 How to Organize Your Infrastructure

We recommend an Independent Project structure. Treat this repo as a dependency (like a library), and build your actual infrastructure in separate project folders.

The Model: Projects & Stacks

  1. Project: Represents a specific Topology. (e.g., "Training Platform", "Model Serving").
  2. Stack: Represents an Environment for that topology. (e.g., dev, staging, prod).

This ensures that your "Training Platform" is completely isolated from your "Web App", but your dev training environment is an exact mirror of prod.

Directory Structure Example

.
├── pulumi_eks_ml/               # 📦 The Shared Library (Infrastructure Code)
├── pyproject.toml
│
├── projects/                    # 🚀 Your Live Infrastructure   │
│   ├── ml-training-platform/    # PROJECT 1: Heavy GPU training      ├── __main__.py          # Definition: VPC + EKS + GPU Pools      ├── Pulumi.dev.yaml      # Config: Small instances, 1 region      └── Pulumi.prod.yaml     # Config: P4d instances, 3 regions   │
│   └── model-inference-api/     # PROJECT 2: High-uptime CPU/Inf1 serving       ├── __main__.py          # Definition: Multi-region VPC + EKS       ├── Pulumi.staging.yaml
│       └── Pulumi.prod.yaml

🛠 Getting Started

Prerequisites

1. Install Dependencies

Install the library in your environment:

uv sync --dev

2. Create Your Project

Create a folder for your new infrastructure topology.

mkdir -p projects/my-ml-platform && cd projects/my-ml-platform
# Chose uv as the toolchain
uv run pulumi new python --name my-ml-platform --force
# Run 'uv add ../../. --editable' to add 'pulumi_eks_ml' as an editable dependency
uv add ../../. --editable
# Remove the requirements.txt and main.py files (unnecessary)
rm requirements.txt main.py
# Source the project's virtual environment (IMPORTANT!)
source .venv/bin/activate

Note that we created a starter project in projects/starter that you can use as a reference.

3. Initialize Environments

Create stacks for the environments you need to support.

pulumi stack init dev
pulumi stack init prod

4. Write Your Infrastructure Code

In projects/my-ml-platform/__main__.py, import the library and define your platform.

import pulumi
from pulumi_eks_ml import vpc, eks, eks_addons

# 1. Load Environment Config
cfg = pulumi.Config()
instance_type = cfg.require("gpuInstanceType")
env_name = pulumi.get_stack()

# 2. Define Networking
# Creates a VPC isolated to this environment
my_vpc = vpc.Vpc(f"{env_name}-vpc")

# 3. Define Compute
cluster = eks.EKSCluster(
    f"{env_name}-cluster",
    vpc_id=my_vpc.vpc_id,
    subnet_ids=my_vpc.private_subnet_ids,
    # Define Node Pools
    node_pools=[
        eks.NodePoolConfig(
            name="gpu-workload",
            instance_type=instance_type,  # Injected from stack config!
            capacity_type="spot",         # Save money on training
        )
    ],
)

# 3b. Enable Platform Services (install addons)
addon_installations = eks.cluster.EKSClusterAddonInstaller(
    f"{env_name}-addons",
    cluster=cluster,
    addon_types=eks_addons.recommended_addons(),
)

# 4. Export Outputs
pulumi.export("kubeconfig", cluster.kubeconfig)

5. Configure & Deploy

Set the variables for your dev stack and deploy.

# Configure Dev
pulumi stack select dev
pulumi config set aws:region us-west-2
pulumi config set gpuInstanceType g5.xlarge

# Deploy
uv run pulumi up

🧪 Testing

We treat infrastructure code like software. The library includes tests you can run locally.

# Run Unit Tests (Fast, mocked AWS calls)
uv run pytest -vv tests/unit

# Run Integration Tests (Real provisioning against LocalStack - no need to start LocalStack manually)
uv run pytest -vv tests/integration

📄 License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulumi_eks_ml-0.1.2.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pulumi_eks_ml-0.1.2-py3-none-any.whl (36.8 kB view details)

Uploaded Python 3

File details

Details for the file pulumi_eks_ml-0.1.2.tar.gz.

File metadata

  • Download URL: pulumi_eks_ml-0.1.2.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pulumi_eks_ml-0.1.2.tar.gz
Algorithm Hash digest
SHA256 cd7c3fb0a0e78347caa080c21c18fb9c0ddfe36bf6b78862bb338ba15fd3d2de
MD5 8238fa39aacbade214574cbc71ab022c
BLAKE2b-256 f16838cecabb37d83e13454a3724397a11e0ffd0f7dc67bd06e5c01d007a36fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for pulumi_eks_ml-0.1.2.tar.gz:

Publisher: publish.yml on Roulbac/pulumi-eks-ml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pulumi_eks_ml-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pulumi_eks_ml-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 36.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pulumi_eks_ml-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c8c30326dea57bf3ebef04416d9f66494f9a29713c2ac2f944a43308155f54c9
MD5 b8311f21b1127298a3371dfcef298e43
BLAKE2b-256 599a627bd455408a3b0ecdc17efd470f7ce607bcfab98ba0890341362246f551

See more details on using hashes here.

Provenance

The following attestation bundles were made for pulumi_eks_ml-0.1.2-py3-none-any.whl:

Publisher: publish.yml on Roulbac/pulumi-eks-ml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page