Skip to main content

Self-Healing Kubernetes Platform - AI-powered crash detection and automated remediation for containerized workloads

Project description

๐Ÿš€ CrashSense - Self-Healing Kubernetes Platform

AI-Powered Kubernetes Monitoring with Automated Crash Detection & Remediation

PyPI Python License Kubernetes

Automatically detect and remediate Kubernetes pod crashes, resource exhaustion, and network failures

๐Ÿš€ Install from PyPI โ€ข ๐Ÿ“– Documentation โ€ข ๐Ÿ’ฌ Support


โœจ What is CrashSense?

CrashSense is a comprehensive self-healing Kubernetes platform that combines AI-powered log analysis with automated remediation for containerized workloads. Originally designed for crash log analysis, it now provides enterprise-grade Kubernetes cluster monitoring, intelligent issue detection, and autonomous healing capabilities.

๐ŸŽฏ Key Use Cases

Use Case Description
๐Ÿ”„ Self-Healing K8s Automatically detect and fix pod crashes, OOMKilled containers, and CrashLoopBackOff
๐Ÿ“Š Resource Management Monitor and remediate resource exhaustion (CPU/memory limits)
๐ŸŒ Network Reliability Detect service endpoint failures and network issues
๐Ÿ“ˆ Prometheus Integration Collect metrics and integrate with Alertmanager for comprehensive monitoring
๐Ÿง  AI-Powered Analysis Leverage LLMs to analyze crash logs and suggest intelligent fixes
๐Ÿ–ฅ๏ธ Traditional Monitoring Support for web servers, system logs, and CI/CD pipelines

๐ŸŒŸ Features & Highlights

๐Ÿ” Kubernetes Monitoring

  • Pod crash detection (CrashLoopBackOff, OOMKilled)
  • Resource exhaustion monitoring
  • Network failure detection
  • Real-time cluster health checks
  • Multi-namespace support

๐Ÿฅ Self-Healing

  • Automated pod restart/deletion
  • Memory limit auto-scaling
  • Service endpoint remediation
  • Deployment rollout management
  • Configurable dry-run mode

๐Ÿ“Š Observability

  • Prometheus metrics exposure
  • Alertmanager integration
  • Custom metric collection
  • Webhook receivers for alerts
  • Historical trend analysis

๐Ÿง  AI-Powered

  • GPT/Ollama integration for log analysis
  • Root cause identification
  • Intelligent remediation suggestions
  • RAG over documentation
  • Context-aware fixes

๐Ÿš€ Quick Start

Installation

# Install from PyPI with Kubernetes support
pip install crashsense

# Or install from source (development)
git clone https://github.com/AzizBahloul/CrashSense.git
cd CrashSense
pip install -e .

Initial Setup

# Initialize and configure LLM provider
crashsense init

Choose your preferred provider:

  • OpenAI GPT (recommended for accuracy)
  • Local Ollama (privacy-focused, no API costs)

Kubernetes Setup

Enable Kubernetes monitoring in ~/.crashsense/config.toml:

[kubernetes]
enabled = true
kubeconfig = null  # Uses default ~/.kube/config
namespaces = []  # Monitor all namespaces, or specify: ["production", "staging"]
auto_heal = true
dry_run = false  # Set to true for safe testing
max_remediation_actions = 10

[prometheus]
enabled = true
url = "http://localhost:9090"
alertmanager_url = "http://localhost:9093"
metrics_port = 8000

๐Ÿ’ป Usage Examples

Kubernetes Monitoring

Check Cluster Health

# View cluster status and health metrics
crashsense k8s status

# Check specific namespaces
crashsense k8s status -n production -n staging

One-Time Scan and Heal

# Detect and fix issues (with confirmation)
crashsense k8s heal

# Dry-run mode (simulate without applying changes)
crashsense k8s heal --dry-run

Continuous Monitoring

# Monitor cluster every 60 seconds
crashsense k8s monitor

# Enable auto-heal mode
crashsense k8s monitor --auto-heal

# Custom interval
crashsense k8s monitor --interval 30 --auto-heal

Pod Log Analysis

# Get pod logs
crashsense k8s logs my-pod -n production

# Analyze logs with AI
crashsense k8s logs my-pod --analyze

# Previous container logs (for crashed pods)
crashsense k8s logs my-pod --previous --analyze

Traditional Log Analysis

# Auto-detect and analyze latest crash log
crashsense

# Analyze specific log file
crashsense analyze /var/log/myapp/error.log

# Interactive TUI mode
crashsense tui

๐Ÿ”ง Kubernetes Remediation Capabilities

CrashSense automatically handles common Kubernetes issues:

Pod Crash Issues

  • CrashLoopBackOff: Analyzes logs, deletes pods with high restart counts
  • ImagePullBackOff: Checks image pull secrets and registry configuration
  • OOMKilled: Increases memory limits automatically (50% increase)
  • CreateContainerError: Identifies configuration issues

Resource Exhaustion

  • High Memory: Auto-scales memory limits and enables HPA
  • High CPU: Scales deployment replicas
  • Quota Exceeded: Recommends quota adjustments

Network Issues

  • No Service Endpoints: Verifies pod selectors and labels
  • Service Unavailable: Checks pod readiness and restarts if needed

Configuration Issues

  • Pending Pods: Analyzes scheduling constraints and node resources
  • Failed Mounts: Identifies PVC and volume issues

๐Ÿ“Š Prometheus & Alertmanager Integration

Expose Metrics

CrashSense exposes Prometheus metrics:

# Metrics available at http://localhost:8000/metrics

Available Metrics:

  • crashsense_pod_crashes_total - Total pod crashes detected
  • crashsense_remediations_total - Total remediation actions taken
  • crashsense_pod_health - Pod health status (0/1)
  • crashsense_cluster_health_score - Overall cluster health (0-100)
  • crashsense_remediation_duration_seconds - Remediation action duration

Alertmanager Webhook

Configure Alertmanager to trigger CrashSense remediation:

receivers:
  - name: crashsense
    webhook_configs:
      - url: 'http://crashsense:9094/webhook'
        send_resolved: true

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           CrashSense Platform                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚
โ”‚  โ”‚ K8s Monitor  โ”‚โ—„โ”€โ”€โ”€โ”€โ–บโ”‚  Prometheus  โ”‚        โ”‚
โ”‚  โ”‚              โ”‚      โ”‚  Collector   โ”‚        โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ”‚         โ”‚                                        โ”‚
โ”‚         โ–ผ                                        โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚
โ”‚  โ”‚   Analyzer   โ”‚โ—„โ”€โ”€โ”€โ”€โ–บโ”‚  LLM Adapter โ”‚        โ”‚
โ”‚  โ”‚  (AI-Powered)โ”‚      โ”‚ (GPT/Ollama) โ”‚        โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ”‚         โ”‚                                        โ”‚
โ”‚         โ–ผ                                        โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚
โ”‚  โ”‚ Remediation  โ”‚โ—„โ”€โ”€โ”€โ”€โ–บโ”‚   Memory     โ”‚        โ”‚
โ”‚  โ”‚   Engine     โ”‚      โ”‚    Store     โ”‚        โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ”‚                                                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         CLI / TUI / API Interface                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ–ฒ                       โ–ฒ
         โ”‚                       โ”‚
    Kubernetes API         Alertmanager

๐Ÿ›ก๏ธ Safety Features

CrashSense implements multiple safety layers:

  1. Dry-Run Mode: Test remediation without applying changes
  2. Action Limits: Maximum actions per cycle (default: 10)
  3. Confirmation Prompts: Interactive mode requires user approval
  4. Audit Trail: All actions logged with timestamps and results
  5. Rollback Support: Failed actions can be reverted
  6. RBAC Integration: Respects Kubernetes permissions

๐Ÿ“‹ Requirements

System Requirements

  • Python 3.8+
  • Kubernetes cluster (1.28+) with kubectl access
  • Optional: Prometheus & Alertmanager for metrics

Kubernetes Permissions

CrashSense requires these RBAC permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: crashsense
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log", "services", "endpoints"]
    verbs: ["get", "list", "watch", "delete"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "patch"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list"]
  - apiGroups: ["metrics.k8s.io"]
    resources: ["pods", "nodes"]
    verbs: ["get", "list"]

๐ŸŽ“ Advanced Usage

Custom Remediation Policies

Create custom remediation logic:

from crashsense.core.k8s_monitor import KubernetesMonitor
from crashsense.core.remediation import RemediationEngine

# Initialize
monitor = KubernetesMonitor()
engine = RemediationEngine(monitor, dry_run=False)

# Detect issues
crashes = monitor.detect_pod_crashes()

# Apply remediation
for crash in crashes:
    result = engine.remediate_issue(crash)
    print(f"Remediation: {result}")

RAG Document Management

Add Kubernetes documentation for better analysis:

# Add custom documentation
crashsense rag add /path/to/k8s-docs

# Build RAG index
crashsense rag build

# Clear and rebuild
crashsense rag clear
crashsense rag add ./kubernetes-playbooks

Memory Management

View and manage crash analysis history:

# List recent crash analyses
crashsense memory

# Stored in SQLite: ~/.crashsense/memories.db

๐Ÿ”Œ Integration Examples

CI/CD Pipeline

# GitLab CI example
k8s-health-check:
  stage: post-deploy
  script:
    - pip install crashsense
    - crashsense k8s status || exit 1
    - crashsense k8s heal --dry-run

Monitoring Dashboard

# Flask webhook receiver
from flask import Flask, request
from crashsense.core.remediation import RemediationEngine

app = Flask(__name__)

@app.route('/webhook', methods=['POST'])
def alertmanager_webhook():
    alert = request.json
    # Trigger remediation based on alert
    engine.remediate_issue(alert)
    return {'status': 'ok'}

๐Ÿ“š Documentation


๐Ÿค Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details.


๐Ÿ“„ License

MIT License - see LICENSE for details.


๐Ÿ™ Acknowledgments

Built with:


Made with โค๏ธ by Mohamed Aziz Bahloul

โญ Star this repo if you find it useful!

Analyze specific file

crashsense analyze /var/log/apache2/error.log

Pipe from STDIN

tail -f /var/log/syslog | crashsense analyze

Launch interactive TUI

crashsense tui


---

## ๐Ÿ“ธ Screenshots & Workflow

### ๐Ÿ”„ Startup & Device Detection
*CrashSense initializing and detecting compute resources*

![Startup & Device Detection](image1.png)

### ๐Ÿ” Crash Log Analysis & Explanation
*AI-powered analysis showing parsed information and remediation steps*

![Crash Log Analysis & Explanation](image2.png)

### ๐Ÿ“Š Summary Table & Command Suggestions
*Actionable summary with safe shell command recommendations*

![Summary Table & Command Suggestions](image3.png)

---

## ๐Ÿ“š RAG Documentation (Optional)

CrashSense can leverage your existing documentation for more contextual analysis:

### ๐Ÿ“ **Default Knowledge Base**

kb/ # Your custom docs src/data/ โ”œโ”€โ”€ crashsense_best_practices.md โ”œโ”€โ”€ python_exceptions_playbook.md โ”œโ”€โ”€ web_server_error_patterns.md โ””โ”€โ”€ linux_permission_paths.md


### ๐Ÿ› ๏ธ **Manage Documentation**

```bash
# Add custom documentation
crashsense rag add /path/to/docs/

# Clear knowledge base
crashsense rag clear

# Rebuild with dry-run preview
crashsense rag build --dry-run

โš™๏ธ Configuration & Security

๐Ÿ“ Configuration File

# ~/.crashsense/config.toml
[llm]
provider = "openai"  # or "ollama"
model = "gpt-4"

[security]
safe_mode = true
confirm_commands = true

๐Ÿ” Environment Variables

export CRASHSENSE_OPENAI_KEY="your-api-key-here"

๐Ÿ›ก๏ธ Security Features

  • โœ… Command execution requires explicit confirmation
  • โœ… Built-in safety checks and validation
  • โœ… Configurable security policies
  • โœ… Audit trail for executed commands

๐Ÿ”ง Troubleshooting

Ollama Setup Issues

# Manual model pull
ollama pull llama3.2:1b

# Check daemon status
ollama serve

# Verify installation
ollama list

For more help, visit the Ollama Documentation


๐Ÿ’ Support & Donations

If CrashSense has helped streamline your debugging workflow, consider supporting continued development:

Platform ID
๐Ÿ’ณ RedotPay 1951109247
๐ŸŸก Binance 1104913076

Your support helps keep CrashSense free and continuously improving!


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


โญ Star this repo yar7am book!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crashsense-2.0.0.tar.gz (63.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crashsense-2.0.0-py3-none-any.whl (54.6 kB view details)

Uploaded Python 3

File details

Details for the file crashsense-2.0.0.tar.gz.

File metadata

  • Download URL: crashsense-2.0.0.tar.gz
  • Upload date:
  • Size: 63.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for crashsense-2.0.0.tar.gz
Algorithm Hash digest
SHA256 c9206711577bd268fa42a6969e02d504434faeaf76bc34d7576aefce182da9d9
MD5 9c5ad77946bf75ba339b2b27e5fc3ce8
BLAKE2b-256 abaada534b16c3232b77db8c45dda308f0cd882fed72bd60cea3dfcb5b475751

See more details on using hashes here.

File details

Details for the file crashsense-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: crashsense-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 54.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for crashsense-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 da0ff94676bbdd0dba95cb4e250c763ed050977813f047d3ae36a90c702e42a9
MD5 90546b2dd02b035f79f9e477687b95f4
BLAKE2b-256 27452693ea0ed9a65201623339ba652cb24c7e5e2d8aaaf9948c09ceca3bf453

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page