A complete production-ready MLOps framework with built-in distributed training, monitoring, and CI/CD.
Project description
MLOps-Forge
A complete production-ready MLOps framework with built-in distributed training, monitoring, and CI/CD. Deploy ML models to production with confidence using our battle-tested infrastructure. This project implements an end-to-end ML pipeline that follows industry best practices for developing, deploying, and maintaining ML models in production environments at scale.
๐ Table of Contents
- Features
- Architecture
- Getting Started
- Usage
- CI/CD Pipeline
- Development
- Advanced Usage
- Security
- License
๐ Features
- Automated Data Pipeline: Robust data validation, cleaning, and feature engineering
- Experiment Tracking: Comprehensive version control for models, datasets, and hyperparameters with MLflow
- Distributed Training: GPU-accelerated training across multiple nodes for large models
- Model Registry: Centralized model storage and versioning with lifecycle management
- Continuous Integration/Deployment: Automated testing, validation, and deployment pipelines
- Model Serving API: Fast and scalable REST API with input validation and automatic documentation
- Model Monitoring: Performance tracking, drift detection, and automated retraining triggers
- A/B Testing: Framework for model experimentation and controlled rollouts
- Infrastructure as Code: Docker containers and Kubernetes configurations for reliable deployments
๐๏ธ Architecture
This system follows a modular microservice architecture with the following components:
graph TD
%% Main title and styles
classDef pipeline fill:#f0f6ff,stroke:#3273dc,color:#3273dc,stroke-width:2px
classDef component fill:#ffffff,stroke:#209cee,color:#209cee,stroke-width:1.5px
classDef note fill:#fffaeb,stroke:#ffdd57,color:#946c00,stroke-width:1px,stroke-dasharray:5 5
classDef infra fill:#e3fcf7,stroke:#00d1b2,color:#00d1b2,stroke-width:1.5px,stroke-dasharray:5 5
%% Infrastructure
subgraph K8S["Kubernetes Cluster"]
%% Data Pipeline
subgraph DP["Data Pipeline"]
DI[Data Ingestion]:::component
DV[Data Validation]:::component
FE[Feature Engineering]:::component
FSN[Feature Store Integration]:::note
DI --> DV
DV --> FE
end
%% Model Training
subgraph MT["Model Training"]
ET[Experiment Tracking - MLflow]:::component
DT[Distributed Training]:::component
ME[Model Evaluation]:::component
ABN[A/B Testing Framework]:::note
ET --> DT
DT --> ME
end
%% Model Registry
subgraph MR["Model Registry"]
MV[Model Versioning]:::component
MS[Metadata Storage]:::component
MCI[CI/CD Integration]:::note
MV --> MS
end
%% API Layer
subgraph API["API Layer"]
FA[FastAPI Application]:::component
PE[Prediction Endpoints]:::component
HM[Health & Metadata APIs]:::component
HPA[Horizontal Pod Autoscaling]:::note
FA --> PE
FA --> HM
end
%% Monitoring
subgraph MON["Monitoring"]
PM[Prometheus Metrics]:::component
GD[Grafana Dashboards]:::component
DD[Feature-level Drift Detection]:::component
RT[Automated Retraining Triggers]:::component
AM[Alert Manager Integration]:::note
MPT[Model Performance Tracking]:::component
DQM[Data Quality Monitoring]:::component
ABT[A/B Testing Analytics]:::component
LA[Log Aggregation]:::component
DT2[Distributed Tracing]:::note
PM --> GD
PM --> DD
DD --> RT
MPT --> DQM
DQM --> ABT
ABT --> LA
end
%% Component relationships
DP -->|Training Data| MT
DP -->|Metadata| MR
MT -->|Model Artifacts| MR
MR -->|Latest Model| API
API -->|Metrics| MON
MT -->|Performance Metrics| MON
end
%% CI/CD Pipeline
CICD[CI/CD Pipeline: GitHub Actions]:::infra
CICD -->|Deploy| K8S
%% Apply classes
class DP,MT,MR,API,MON pipeline
Component Details
-
Data Pipeline
- Data Ingestion: Connectors for various data sources (databases, object storage, streaming)
- Data Validation: Schema validation, data quality checks, and anomaly detection
- Feature Engineering: Feature transformation, normalization, and feature store integration
-
Model Training
- Experiment Tracking: MLflow integration for tracking parameters, metrics, and artifacts
- Distributed Training: PyTorch distributed training for efficient model training
- Model Evaluation: Comprehensive metrics calculation and validation
-
Model Registry
- Model Versioning: Storage and versioning of models with metadata
- Artifact Management: Efficient storage of model artifacts and associated files
- Deployment Management: Tracking of model deployment status
-
API Layer
- FastAPI Application: High-performance API with automatic OpenAPI documentation
- Prediction Endpoints: RESTful endpoints for model inference
- Health & Metadata: Endpoints for system health checks and model metadata
-
Monitoring System
- Metrics Collection: Prometheus integration for metrics collection
- Drift Detection: Statistical methods to detect data and concept drift
- Performance Tracking: Continuous monitoring of model performance metrics
- Automated Retraining: Triggers for retraining based on drift detection
System Flow
-
Development Workflow:
flowchart LR DS[Data Scientist] --> |Develops Model| DEV[Development Environment] DEV --> |Commits Code| GIT[Git Repository] GIT --> |Triggers| CI[CI/CD Pipeline] CI --> |Runs Tests| TEST[Test Suite] TEST --> |Validates Model| VAL[Model Validation] VAL --> |Performance Testing| PERF[Performance Tests] PERF --> |Builds| BUILD[Docker Image] BUILD --> |Deploys| DEPLOY[Kubernetes Cluster] -
Production Data Flow:
flowchart LR DATA[Data Sources] --> |Ingestion| PIPE[Data Pipeline] PIPE --> |Validated Data| TRAIN[Training Pipeline] TRAIN --> |Trained Model| REG[Model Registry] REG --> |Latest Model| API[API Service] API --> |Predictions| USERS[End Users] API --> |Metrics| MON[Monitoring] MON --> |Drift Detected| RETRAIN[Retraining Trigger] RETRAIN --> TRAIN
๐ฆ Getting Started
Prerequisites
- Python 3.9+ with pip
- Docker and Docker Compose
- Kubernetes cluster (local or cloud-based)
- AWS account (for cloud deployment)
Installation
# Clone the repository
git clone https://github.com/TaimoorKhan10/MLOps-Forge.git
cd MLOps-Forge
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -e .
Configuration
-
Environment Variables: Create a
.envfile based on the provided.env.example:# MLflow Configuration MLFLOW_TRACKING_URI=http://mlflow:5000 MLFLOW_S3_ENDPOINT_URL=http://minio:9000 # AWS Configuration for Deployment AWS_ACCESS_KEY_ID=your-access-key AWS_SECRET_ACCESS_KEY=your-secret-key AWS_REGION=us-west-2 # Kubernetes Configuration K8S_NAMESPACE=mlops-production -
Infrastructure Setup:
# For local development with Docker Compose docker-compose up -d # For Kubernetes deployment kubectl apply -f infrastructure/kubernetes/
๐งฐ Usage
Data Pipeline
from mlops_production_system.pipeline import DataPipeline
# Initialize the pipeline
pipeline = DataPipeline(config_path="config/pipeline_config.yaml")
# Run the pipeline
processed_data = pipeline.run(input_data_path="data/raw/training_data.csv")
Model Training
from mlops_production_system.models import ModelTrainer
from mlops_production_system.training import distributed_trainer
# For single-node training
trainer = ModelTrainer(model_config="config/model_config.yaml")
model = trainer.train(X_train, y_train)
metrics = trainer.evaluate(X_test, y_test)
# For distributed training
distributed_trainer.run(
model_class="mlops_production_system.models.CustomModel",
data_path="data/processed/training_data.parquet",
num_nodes=4
)
Model Deployment
# Deploy model using CLI
mlops deploy --model-name="my-model" --model-version=1 --environment=production
# Or using the Python API
from mlops_production_system.deployment import ModelDeployer
deployer = ModelDeployer()
deployer.deploy(model_name="my-model", model_version=1, environment="production")
Monitoring
from mlops_production_system.monitoring import DriftDetector, PerformanceMonitor
# Monitor for drift
drift_detector = DriftDetector(reference_data="data/reference.parquet")
drift_results = drift_detector.detect(new_data="data/production_data.parquet")
# Monitor model performance
performance_monitor = PerformanceMonitor(model_name="my-model", model_version=1)
performance_metrics = performance_monitor.get_metrics(timeframe="last_24h")
๐ CI/CD Pipeline
The system uses GitHub Actions for CI/CD pipeline, configured in .github/workflows/main.yml. The pipeline includes:
-
Code Quality:
- Linting with flake8
- Type checking with mypy
- Security scanning with bandit
-
Testing:
- Unit tests with pytest
- Integration tests
- Code coverage reporting
-
Model Validation:
- Performance benchmarking
- Model quality checks
- Validation against baseline metrics
-
Deployment:
- Docker image building
- Image pushing to container registry
- Kubernetes deployment updates
All secrets and credentials are stored securely in GitHub Secrets and only accessed during workflow execution.
๐จโ๐ป Development
Project Structure
MLOps-Production-System/
โโโ .github/ # GitHub Actions workflows
โโโ config/ # Configuration files
โโโ data/ # Data directories (gitignored)
โโโ docs/ # Documentation
โโโ infrastructure/ # Infrastructure as code
โ โโโ docker/ # Docker configurations
โ โโโ kubernetes/ # Kubernetes manifests
โ โโโ terraform/ # Terraform for cloud resources
โโโ notebooks/ # Jupyter notebooks
โโโ scripts/ # Utility scripts
โโโ src/ # Source code
โ โโโ mlops_production_system/
โ โโโ api/ # FastAPI application
โ โโโ models/ # ML models
โ โโโ pipeline/ # Data pipeline
โ โโโ training/ # Training code
โ โโโ monitoring/ # Monitoring tools
โ โโโ utils/ # Utilities
โโโ tests/ # Test suite
โโโ .env.example # Example environment variables
โโโ Dockerfile # Main Dockerfile
โโโ pyproject.toml # Project metadata
โโโ README.md # This file
Contributing
We follow the GitFlow branching model:
- Create a feature branch from
develop:git checkout -b feature/your-feature - Make your changes and commit:
git commit -m "Add feature" - Push your branch:
git push origin feature/your-feature - Open a Pull Request against the
developbranch
All PRs must pass CI checks and code review before being merged.
๐ฌ Advanced Usage
Distributed Training
The system supports distributed training using PyTorch's DistributedDataParallel for efficient multi-node training:
# Example Kubernetes configuration in infrastructure/kubernetes/distributed-training.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
spec:
parallelism: 4
template:
spec:
containers:
- name: trainer
image: your-registry/mlops-trainer:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: WORLD_SIZE
value: "4"
A/B Testing
The A/B testing framework allows comparing multiple models in production:
from mlops_production_system.monitoring import ABTestingFramework
# Set up A/B test between two models
ab_test = ABTestingFramework()
ab_test.create_experiment(
name="pricing_model_comparison",
models=["pricing_model_v1", "pricing_model_v2"],
traffic_split=[0.5, 0.5],
evaluation_metric="conversion_rate"
)
# Get results
results = ab_test.get_results(experiment_name="pricing_model_comparison")
Drift Detection
Detect data drift to trigger model retraining:
from mlops_production_system.monitoring import DriftDetector
# Initialize with reference data distribution
detector = DriftDetector(
reference_data="s3://bucket/reference_data.parquet",
features=["feature1", "feature2", "feature3"],
drift_method="wasserstein",
threshold=0.1
)
# Check for drift in new data
drift_detected, drift_metrics = detector.detect(
current_data="s3://bucket/production_data.parquet"
)
if drift_detected:
# Trigger retraining
from mlops_production_system.training import trigger_retraining
trigger_retraining(model_name="my-model")
๐ Security
This project follows security best practices:
- Secrets management via environment variables and Kubernetes secrets
- Regular dependency scanning for vulnerabilities
- Least privilege principle for all service accounts
- Network policies to restrict pod-to-pod communication
- Encryption of data at rest and in transit
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
MLOps-Forge was created to demonstrate end-to-end machine learning operations and follows industry best practices for deploying ML models in production environments. Star us on GitHub if you find this project useful!
๐ง Technologies
- ML Framework: scikit-learn, PyTorch
- Feature Store: feast
- Experiment Tracking: MLflow
- API: FastAPI
- Containerization: Docker
- Orchestration: Kubernetes
- CI/CD: GitHub Actions
- Infrastructure as Code: Terraform
- Monitoring: Prometheus, Grafana
๐ ๏ธ Installation
Prerequisites
- Python 3.9+
- Docker and Docker Compose
- Kubernetes (optional for local development)
Setup
-
Clone the repository
git clone https://github.com/TaimoorKhan10/MLOps-Production-System.git cd MLOps-Production-System
-
Create a virtual environment and install dependencies
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Set up environment variables
cp .env.example .env # Edit .env with your configuration
-
Start the development environment
docker-compose up -d
๐ Demo
Access the demo application at http://localhost:8000 after starting the containers.
The demo includes:
- Model training dashboard
- Real-time inference API
- Performance monitoring
๐ Documentation
Comprehensive documentation is available in the /docs directory:
๐งช Testing
Run the test suite:
pytest
๐ข Deployment
Local Deployment
docker-compose up -d
Cloud Deployment (AWS)
cd infrastructure/terraform
terraform init
terraform apply
๐ Monitoring
Access the monitoring dashboard at http://localhost:3000 after deployment.
๐ค Contributing
Contributions are welcome! Please check out our contribution guidelines.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mlops_forge-1.0.0.tar.gz.
File metadata
- Download URL: mlops_forge-1.0.0.tar.gz
- Upload date:
- Size: 71.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5660991ec531a150ffd9d0ad8b750c5241e95ed9aedf737a7e58b025727c7a49
|
|
| MD5 |
1c201ae0b1d54a29618b38404bd765b7
|
|
| BLAKE2b-256 |
6321e02e4ac9b764782d1eab7e2959f9930d5539896d5506532f9e0b9a31d26f
|