NVIDIA DGX Spark + Cloud Nemotron

Self-healing
Kubernetes copilot

An edge AI agent that detects cluster anomalies on-device and reasons over your entire infrastructure in the cloud — then fixes it before you wake up.

<3s

Detection

87%

Auto-resolved

False escalations

How it works

Edge detect. Cloud reason. Auto-fix.

Raw telemetry stays on-device. Only structured incident payloads reach the cloud for topology-aware remediation planning.

Edge · DGX Spark

Event watcher

Streams kubectl events, pod logs, and resource metrics continuously from the node

Nemotron edge

Classifies anomalies locally — CrashLoopBackOff, OOM kills, scheduling failures, resource pressure

Incident packager

Structures findings into a typed payload with severity, affected resources, and event timeline

→

Cloud · Nemotron

Topology reasoner

Cross-references the incident against Terraform state, node pool config, PDBs, and cluster dependencies

Remediation planner

Generates a ranked action plan — scale pool, adjust PDB, taint/drain, rollback deployment

→

Action layer

⚡

Auto-apply

Safe fixes execute immediately — pod restart, HPA adjustment, node cordon

Escalate to Slack

Risky actions surface as Slack threads with diff preview and one-click approval

Capabilities

Built for platform teams

◉

Continuous watch

Sidecar agent parses every event, log line, and metric on the node. No polling lag, no missed signals.

◈

Terraform-aware

Reads your IaC state to understand what should exist vs what does. No config drift surprises.

◇

Safe-by-default

Actions are tiered: safe ops auto-execute, destructive ops require human approval via Slack.

◎

Sub-3s detection

Edge inference catches anomalies before the next Prometheus scrape interval fires.

◉

Data stays local

Raw logs and metrics never leave the node. Only structured incident summaries reach the cloud.

◈

Incident memory

Learns from past incidents to reduce false positives and refine remediation strategies over time.

Live demo

Watch it heal

Force-delete a pod on stage. Watch the terminal cascade from red to green in under 3 seconds.

infraagent — prod-cluster-01

$ kubectl delete pod payment-svc-7f8b9 --force
■ [edge] CrashLoopBackOff detected → payment-svc-7f8b9 (ns: prod)
■ [edge] Packaging incident: severity=high, restarts=5, oom=false
■ [cloud] Correlating with terraform state... 3 replicas expected
■ [cloud] PDB allows 1 unavailable. Current: 1 unavailable. Safe to act.
■ [action] Auto-applying: kubectl rollout restart deploy/payment-svc
■ [action] ✓ Deployment healthy — 3/3 pods running (resolved in 2.4s)

NVIDIA DGX Spark Nemotron GKE Terraform Kubernetes API Slack API Prometheus Python

Self-healingKubernetes copilot

Self-healing
Kubernetes copilot