Self-Healing Kubernetes Platform - AI-powered crash detection and automated remediation for containerized workloads
Project description
๐ CrashSense - Self-Healing Kubernetes Platform
AI-Powered Kubernetes Monitoring with Automated Crash Detection & Remediation
Automatically detect and remediate Kubernetes pod crashes, resource exhaustion, and network failures
๐ Install from PyPI โข ๐ Documentation โข ๐ฌ Support
โจ What is CrashSense?
CrashSense is a comprehensive self-healing Kubernetes platform that combines AI-powered log analysis with automated remediation for containerized workloads. Originally designed for crash log analysis, it now provides enterprise-grade Kubernetes cluster monitoring, intelligent issue detection, and autonomous healing capabilities.
๐ฏ Key Use Cases
| Use Case | Description |
|---|---|
| ๐ Self-Healing K8s | Automatically detect and fix pod crashes, OOMKilled containers, and CrashLoopBackOff |
| ๐ Resource Management | Monitor and remediate resource exhaustion (CPU/memory limits) |
| ๐ Network Reliability | Detect service endpoint failures and network issues |
| ๐ Prometheus Integration | Collect metrics and integrate with Alertmanager for comprehensive monitoring |
| ๐ง AI-Powered Analysis | Leverage LLMs to analyze crash logs and suggest intelligent fixes |
| ๐ฅ๏ธ Traditional Monitoring | Support for web servers, system logs, and CI/CD pipelines |
๐ Features & Highlights
๐ Kubernetes Monitoring
๐ฅ Self-Healing
|
๐ Observability
๐ง AI-Powered
|
๐ Quick Start
Installation
# Install from PyPI with Kubernetes support
pip install crashsense
# Or install from source (development)
git clone https://github.com/AzizBahloul/CrashSense.git
cd CrashSense
pip install -e .
Initial Setup
# Initialize and configure LLM provider
crashsense init
Choose your preferred provider:
- OpenAI GPT (recommended for accuracy)
- Local Ollama (privacy-focused, no API costs)
Kubernetes Setup
Enable Kubernetes monitoring in ~/.crashsense/config.toml:
[kubernetes]
enabled = true
kubeconfig = null # Uses default ~/.kube/config
namespaces = [] # Monitor all namespaces, or specify: ["production", "staging"]
auto_heal = true
dry_run = false # Set to true for safe testing
max_remediation_actions = 10
[prometheus]
enabled = true
url = "http://localhost:9090"
alertmanager_url = "http://localhost:9093"
metrics_port = 8000
๐ป Usage Examples
Kubernetes Monitoring
Check Cluster Health
# View cluster status and health metrics
crashsense k8s status
# Check specific namespaces
crashsense k8s status -n production -n staging
One-Time Scan and Heal
# Detect and fix issues (with confirmation)
crashsense k8s heal
# Dry-run mode (simulate without applying changes)
crashsense k8s heal --dry-run
Continuous Monitoring
# Monitor cluster every 60 seconds
crashsense k8s monitor
# Enable auto-heal mode
crashsense k8s monitor --auto-heal
# Custom interval
crashsense k8s monitor --interval 30 --auto-heal
Pod Log Analysis
# Get pod logs
crashsense k8s logs my-pod -n production
# Analyze logs with AI
crashsense k8s logs my-pod --analyze
# Previous container logs (for crashed pods)
crashsense k8s logs my-pod --previous --analyze
Traditional Log Analysis
# Auto-detect and analyze latest crash log
crashsense
# Analyze specific log file
crashsense analyze /var/log/myapp/error.log
# Interactive TUI mode
crashsense tui
๐ง Kubernetes Remediation Capabilities
CrashSense automatically handles common Kubernetes issues:
Pod Crash Issues
- CrashLoopBackOff: Analyzes logs, deletes pods with high restart counts
- ImagePullBackOff: Checks image pull secrets and registry configuration
- OOMKilled: Increases memory limits automatically (50% increase)
- CreateContainerError: Identifies configuration issues
Resource Exhaustion
- High Memory: Auto-scales memory limits and enables HPA
- High CPU: Scales deployment replicas
- Quota Exceeded: Recommends quota adjustments
Network Issues
- No Service Endpoints: Verifies pod selectors and labels
- Service Unavailable: Checks pod readiness and restarts if needed
Configuration Issues
- Pending Pods: Analyzes scheduling constraints and node resources
- Failed Mounts: Identifies PVC and volume issues
๐ Prometheus & Alertmanager Integration
Expose Metrics
CrashSense exposes Prometheus metrics:
# Metrics available at http://localhost:8000/metrics
Available Metrics:
crashsense_pod_crashes_total- Total pod crashes detectedcrashsense_remediations_total- Total remediation actions takencrashsense_pod_health- Pod health status (0/1)crashsense_cluster_health_score- Overall cluster health (0-100)crashsense_remediation_duration_seconds- Remediation action duration
Alertmanager Webhook
Configure Alertmanager to trigger CrashSense remediation:
receivers:
- name: crashsense
webhook_configs:
- url: 'http://crashsense:9094/webhook'
send_resolved: true
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CrashSense Platform โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ K8s Monitor โโโโโโโบโ Prometheus โ โ
โ โ โ โ Collector โ โ
โ โโโโโโโโฌโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Analyzer โโโโโโโบโ LLM Adapter โ โ
โ โ (AI-Powered)โ โ (GPT/Ollama) โ โ
โ โโโโโโโโฌโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Remediation โโโโโโโบโ Memory โ โ
โ โ Engine โ โ Store โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ CLI / TUI / API Interface โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ โฒ
โ โ
Kubernetes API Alertmanager
๐ก๏ธ Safety Features
CrashSense implements multiple safety layers:
- Dry-Run Mode: Test remediation without applying changes
- Action Limits: Maximum actions per cycle (default: 10)
- Confirmation Prompts: Interactive mode requires user approval
- Audit Trail: All actions logged with timestamps and results
- Rollback Support: Failed actions can be reverted
- RBAC Integration: Respects Kubernetes permissions
๐ Requirements
System Requirements
- Python 3.8+
- Kubernetes cluster (1.28+) with kubectl access
- Optional: Prometheus & Alertmanager for metrics
Kubernetes Permissions
CrashSense requires these RBAC permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: crashsense
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "services", "endpoints"]
verbs: ["get", "list", "watch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "patch"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list"]
- apiGroups: ["metrics.k8s.io"]
resources: ["pods", "nodes"]
verbs: ["get", "list"]
๐ Advanced Usage
Custom Remediation Policies
Create custom remediation logic:
from crashsense.core.k8s_monitor import KubernetesMonitor
from crashsense.core.remediation import RemediationEngine
# Initialize
monitor = KubernetesMonitor()
engine = RemediationEngine(monitor, dry_run=False)
# Detect issues
crashes = monitor.detect_pod_crashes()
# Apply remediation
for crash in crashes:
result = engine.remediate_issue(crash)
print(f"Remediation: {result}")
RAG Document Management
Add Kubernetes documentation for better analysis:
# Add custom documentation
crashsense rag add /path/to/k8s-docs
# Build RAG index
crashsense rag build
# Clear and rebuild
crashsense rag clear
crashsense rag add ./kubernetes-playbooks
Memory Management
View and manage crash analysis history:
# List recent crash analyses
crashsense memory
# Stored in SQLite: ~/.crashsense/memories.db
๐ Integration Examples
CI/CD Pipeline
# GitLab CI example
k8s-health-check:
stage: post-deploy
script:
- pip install crashsense
- crashsense k8s status || exit 1
- crashsense k8s heal --dry-run
Monitoring Dashboard
# Flask webhook receiver
from flask import Flask, request
from crashsense.core.remediation import RemediationEngine
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def alertmanager_webhook():
alert = request.json
# Trigger remediation based on alert
engine.remediate_issue(alert)
return {'status': 'ok'}
๐ Documentation
๐ค Contributing
Contributions are welcome! Please see CONTRIBUTING.md for details.
๐ License
MIT License - see LICENSE for details.
๐ Acknowledgments
Built with:
- Kubernetes Python Client
- Prometheus Client
- Rich for beautiful terminal output
- OpenAI GPT & Ollama for AI-powered analysis
Made with โค๏ธ by Mohamed Aziz Bahloul
โญ Star this repo if you find it useful!
Analyze specific file
crashsense analyze /var/log/apache2/error.log
Pipe from STDIN
tail -f /var/log/syslog | crashsense analyze
Launch interactive TUI
crashsense tui
---
## ๐ธ Screenshots & Workflow
### ๐ Startup & Device Detection
*CrashSense initializing and detecting compute resources*

### ๐ Crash Log Analysis & Explanation
*AI-powered analysis showing parsed information and remediation steps*

### ๐ Summary Table & Command Suggestions
*Actionable summary with safe shell command recommendations*

---
## ๐ RAG Documentation (Optional)
CrashSense can leverage your existing documentation for more contextual analysis:
### ๐ **Default Knowledge Base**
kb/ # Your custom docs src/data/ โโโ crashsense_best_practices.md โโโ python_exceptions_playbook.md โโโ web_server_error_patterns.md โโโ linux_permission_paths.md
### ๐ ๏ธ **Manage Documentation**
```bash
# Add custom documentation
crashsense rag add /path/to/docs/
# Clear knowledge base
crashsense rag clear
# Rebuild with dry-run preview
crashsense rag build --dry-run
โ๏ธ Configuration & Security
๐ Configuration File
# ~/.crashsense/config.toml
[llm]
provider = "openai" # or "ollama"
model = "gpt-4"
[security]
safe_mode = true
confirm_commands = true
๐ Environment Variables
export CRASHSENSE_OPENAI_KEY="your-api-key-here"
๐ก๏ธ Security Features
- โ Command execution requires explicit confirmation
- โ Built-in safety checks and validation
- โ Configurable security policies
- โ Audit trail for executed commands
๐ง Troubleshooting
Ollama Setup Issues
# Manual model pull
ollama pull llama3.2:1b
# Check daemon status
ollama serve
# Verify installation
ollama list
For more help, visit the Ollama Documentation
๐ Support & Donations
If CrashSense has helped streamline your debugging workflow, consider supporting continued development:
| Platform | ID |
|---|---|
| ๐ณ RedotPay | 1951109247 |
| ๐ก Binance | 1104913076 |
Your support helps keep CrashSense free and continuously improving!
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
โญ Star this repo yar7am book!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crashsense-2.0.0.tar.gz.
File metadata
- Download URL: crashsense-2.0.0.tar.gz
- Upload date:
- Size: 63.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9206711577bd268fa42a6969e02d504434faeaf76bc34d7576aefce182da9d9
|
|
| MD5 |
9c5ad77946bf75ba339b2b27e5fc3ce8
|
|
| BLAKE2b-256 |
abaada534b16c3232b77db8c45dda308f0cd882fed72bd60cea3dfcb5b475751
|
File details
Details for the file crashsense-2.0.0-py3-none-any.whl.
File metadata
- Download URL: crashsense-2.0.0-py3-none-any.whl
- Upload date:
- Size: 54.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da0ff94676bbdd0dba95cb4e250c763ed050977813f047d3ae36a90c702e42a9
|
|
| MD5 |
90546b2dd02b035f79f9e477687b95f4
|
|
| BLAKE2b-256 |
27452693ea0ed9a65201623339ba652cb24c7e5e2d8aaaf9948c09ceca3bf453
|