Skip to main content

Automatic SageMaker Pipeline Generation from DAG Specifications

Project description

Cursus: Automatic SageMaker Pipeline Generation

PyPI version Python 3.8+ License: MIT

Transform pipeline graphs into production-ready SageMaker pipelines automatically.

Cursus is an intelligent pipeline generation system that automatically creates complete SageMaker pipelines from user-provided pipeline graphs. Simply define your ML workflow as a graph structure, and Cursus handles all the complex SageMaker implementation details, dependency resolution, and configuration management automatically.

🚀 Quick Start

Installation

# Core installation
pip install cursus

# With ML frameworks
pip install cursus[pytorch,xgboost]

# Full installation with all features
pip install cursus[all]

30-Second Example

from cursus.core import compile_dag_to_pipeline
from cursus.api import PipelineDAG
from sagemaker.workflow.pipeline_context import PipelineSession

# Create a simple DAG
dag = PipelineDAG()
dag.add_node("CradleDataLoading_training")
dag.add_node("TabularPreprocessing_training") 
dag.add_node("XGBoostTraining")
dag.add_edge("CradleDataLoading_training", "TabularPreprocessing_training")
dag.add_edge("TabularPreprocessing_training", "XGBoostTraining")

# Set up SageMaker session
pipeline_session = PipelineSession()
role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"

# Compile to SageMaker pipeline automatically
pipeline = compile_dag_to_pipeline(
    dag=dag,
    config_path="config.json",
    sagemaker_session=pipeline_session,
    role=role,
    pipeline_name="fraud-detection"
)
pipeline.upsert()  # Deploy and run!

Command Line Interface

# Generate a new project
cursus init --template xgboost --name fraud-detection

# Validate your DAG
cursus validate my_dag.py

# Compile to SageMaker pipeline
cursus compile my_dag.py --name my-pipeline --output pipeline.json

✨ Key Features

🎯 Graph-to-Pipeline Automation

  • Input: Simple pipeline graph with step types and connections
  • Output: Complete SageMaker pipeline with all dependencies resolved
  • Magic: Intelligent analysis of graph structure with automatic step builder selection

10x Faster Development

  • Before: 2-4 weeks of manual SageMaker configuration
  • After: 10-30 minutes from graph to working pipeline
  • Result: 95% reduction in development time

🧠 Intelligent Dependency Resolution

  • Automatic step connections and data flow
  • Smart configuration matching and validation
  • Type-safe specifications with compile-time checks
  • Semantic compatibility analysis

🛡️ Production Ready

  • Built-in quality gates and validation
  • Enterprise governance and compliance
  • Comprehensive error handling and debugging
  • 98% complete with 1,650+ lines of complex code eliminated

📊 Proven Results

Based on production deployments across enterprise environments:

Component Code Reduction Lines Eliminated Key Benefit
Processing Steps 60% 400+ lines Automatic input/output resolution
Training Steps 60% 300+ lines Intelligent hyperparameter handling
Model Steps 47% 380+ lines Streamlined model creation
Registration Steps 66% 330+ lines Simplified deployment workflows
Overall System ~55% 1,650+ lines Intelligent automation

🏗️ Architecture

Cursus follows a sophisticated layered architecture:

  • 🎯 User Interface: Fluent API and Pipeline DAG for intuitive construction
  • 🧠 Intelligence Layer: Smart proxies with automatic dependency resolution
  • 🏗️ Orchestration: Pipeline assembler and compiler for DAG-to-template conversion
  • 📚 Registry Management: Multi-context coordination with lifecycle management
  • 🔗 Dependency Resolution: Intelligent matching with semantic compatibility
  • 📋 Specification Layer: Comprehensive step definitions with quality gates

📚 Usage Examples

Basic Pipeline

from cursus.core import compile_dag_to_pipeline
from cursus.api import PipelineDAG
from sagemaker.workflow.pipeline_context import PipelineSession

# Create DAG
dag = PipelineDAG()
dag.add_node("CradleDataLoading_training")
dag.add_node("XGBoostTraining")
dag.add_edge("CradleDataLoading_training", "XGBoostTraining")

# Set up SageMaker session
pipeline_session = PipelineSession()
role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"

# Compile to SageMaker pipeline
pipeline = compile_dag_to_pipeline(
    dag=dag,
    config_path="config.json",
    sagemaker_session=pipeline_session,
    role=role,
    pipeline_name="my-ml-pipeline"
)

Advanced Configuration

from cursus.core import compile_dag_to_pipeline, PipelineDAGCompiler
from cursus.api import PipelineDAG
from sagemaker.workflow.pipeline_context import PipelineSession

# Create DAG with more complex workflow
dag = PipelineDAG()
dag.add_node("CradleDataLoading_training")
dag.add_node("TabularPreprocessing_training")
dag.add_node("XGBoostTraining")
dag.add_node("CradleDataLoading_calibration")
dag.add_node("TabularPreprocessing_calibration")
dag.add_node("XGBoostModelEval_calibration")

# Add edges for training flow
dag.add_edge("CradleDataLoading_training", "TabularPreprocessing_training")
dag.add_edge("TabularPreprocessing_training", "XGBoostTraining")

# Add edges for calibration flow
dag.add_edge("CradleDataLoading_calibration", "TabularPreprocessing_calibration")
dag.add_edge("XGBoostTraining", "XGBoostModelEval_calibration")
dag.add_edge("TabularPreprocessing_calibration", "XGBoostModelEval_calibration")

# Set up SageMaker session
pipeline_session = PipelineSession()
role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"

# Compile with validation and reporting
compiler = PipelineDAGCompiler(
    config_path="config.json",
    sagemaker_session=pipeline_session,
    role=role
)

# Validate DAG before compilation
validation = compiler.validate_dag_compatibility(dag)
if validation.is_valid:
    print(f"✅ DAG validation passed! Confidence: {validation.avg_confidence:.2f}")
    
    # Compile with detailed report
    pipeline, report = compiler.compile_with_report(
        dag=dag,
        pipeline_name="advanced-ml-pipeline"
    )
    print(f"📊 Pipeline compiled: {report.summary()}")
else:
    print("❌ DAG validation failed:", validation.config_errors)

Using Pre-built Pipeline Templates

from cursus.pipeline_catalog.pipelines.xgb_training_simple import XGBoostTrainingSimplePipeline
from sagemaker.workflow.pipeline_context import PipelineSession

# Set up SageMaker session
pipeline_session = PipelineSession()
role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"

# Use pre-built pipeline template
pipeline_instance = XGBoostTrainingSimplePipeline(
    config_path="config.json",
    sagemaker_session=pipeline_session,
    execution_role=role,
    enable_mods=False,  # Regular pipeline
    validate=True
)

# Generate the pipeline
pipeline = pipeline_instance.generate_pipeline()

# Deploy to SageMaker
pipeline.upsert()
print(f"✅ Pipeline '{pipeline.name}' deployed successfully!")

Using the Compiler Class Directly

from cursus.core import PipelineDAGCompiler
from cursus.api import PipelineDAG
from cursus.pipeline_catalog.shared_dags.xgboost import create_xgboost_simple_dag
from sagemaker.workflow.pipeline_context import PipelineSession

# Create DAG using shared DAG definitions
dag = create_xgboost_simple_dag()

# Set up SageMaker session
pipeline_session = PipelineSession()
role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"

# Use compiler for more control
compiler = PipelineDAGCompiler(
    config_path="config.json",
    sagemaker_session=pipeline_session,
    role=role
)

# Preview resolution before compilation
preview = compiler.preview_resolution(dag)
for node, config_type in preview.node_config_map.items():
    confidence = preview.resolution_confidence.get(node, 0.0)
    print(f"   {node}{config_type} (confidence: {confidence:.2f})")

# Compile the pipeline
pipeline = compiler.compile(dag, pipeline_name="my-pipeline")

🔧 Installation Options

Core Installation

pip install cursus

Includes basic DAG compilation and SageMaker integration.

Framework-Specific

pip install cursus[pytorch]    # PyTorch Lightning models
pip install cursus[xgboost]    # XGBoost training pipelines  
pip install cursus[nlp]        # NLP models and processing
pip install cursus[processing] # Advanced data processing

Development

pip install cursus[dev]        # Development tools
pip install cursus[docs]       # Documentation tools
pip install cursus[all]        # Everything included

🎯 Who Should Use Cursus?

Data Scientists & ML Practitioners

  • Focus on model development, not infrastructure complexity
  • Rapid experimentation with 10x faster iteration
  • Business-focused interface eliminates SageMaker expertise requirements

Platform Engineers & ML Engineers

  • 60% less code to maintain and debug
  • Specification-driven architecture prevents common errors
  • Universal patterns enable faster team onboarding

Organizations

  • Accelerated innovation with faster pipeline development
  • Reduced technical debt through clean architecture
  • Built-in governance and compliance frameworks

📖 Documentation

📚 Complete Documentation Hub

Your gateway to all Cursus documentation - start here for comprehensive navigation

Knowledge Management Philosophy

  • Zettelkasten Principles - The knowledge management principles behind our slipbox documentation system, explaining how we organize and connect information for maximum discoverability and organic growth

Core Documentation

  • Developer Guide - Comprehensive guide for developing new pipeline steps and extending Cursus
  • Design Documentation - Detailed architectural documentation and design principles
  • Pipeline Catalog - Comprehensive collection of prebuilt pipeline templates organized by framework and task
  • API Reference - Detailed API documentation including core, api, steps, and other components
  • Examples - Ready-to-use pipeline blueprints and examples

Quick Links

🤝 Contributing

We welcome contributions! See our Developer Guide for comprehensive details on:

For architectural insights and design decisions, see the Design Documentation.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links


Cursus: Making SageMaker pipeline development 10x faster through intelligent automation. 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cursus-1.4.3.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cursus-1.4.3-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file cursus-1.4.3.tar.gz.

File metadata

  • Download URL: cursus-1.4.3.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for cursus-1.4.3.tar.gz
Algorithm Hash digest
SHA256 3a937b169bde97ac039733fbb9a5a1a763c2f0d39f0da65f0cfc0c8a47f471a2
MD5 a3a010187a7e655aef5ac3bd2e7167e4
BLAKE2b-256 7886c7c1bb331f2c36fb1caf3f932493b1f8d231b025cb06399eda684b5cfba5

See more details on using hashes here.

File details

Details for the file cursus-1.4.3-py3-none-any.whl.

File metadata

  • Download URL: cursus-1.4.3-py3-none-any.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for cursus-1.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7f770e24a9f63ba0f9b046dc4e06c235bb4c7f285a387427e2db55c7692daca0
MD5 3fc40c9f67aeb1d5fec8d48c5e185d59
BLAKE2b-256 e50f5cbaa214f4ef0722d13ed7105303a21735dddc25be4894dbf0ed356aa1fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page