Seeded intent and wiki pages

2026-06-22 19:09:24 +02:00
parent b181465564
commit 129a229e38
4 changed files with 496 additions and 21 deletions
--- a/INTENT.md
+++ b/INTENT.md
@@ -0,0 +1,171 @@
+# TeleMcp — Project Intent
+
+> **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
+
+TeleMcp is a self-contained observability stack that deploys onto a Linux Kubernetes host and surfaces metrics, logs, and cluster state through a single **Model Context Protocol (MCP)** bridge. The goal is to let an autonomous agent — or a human with an agent — **bootstrap, monitor, triage, and operate** a box without bespoke integrations or constant human supervision.
+
+This document anchors what we are building, why, and what is in scope. When in doubt, prefer the simplest path that gives an agent reliable, read-only situational awareness.
+
+---
+
+## Problem
+
+Operating a Kubernetes host means juggling many signals across many systems: node health, workload status, logs, alerts, certificates, systemd units, and more. Humans use Grafana dashboards, `kubectl`, and ad-hoc PromQL/LogQL. LLM agents need the same information, but through a **standardized, safe interface** — not raw shell access.
+
+TeleMcp closes that gap by:
+
+1. **Collecting** telemetry with proven CNCF/Grafana stack components.
+2. **Deploying** the stack repeatably via Ansible + Helm.
+3. **Bridging** everything to agents through one MCP server with resources, tools, and prompts.
+
+---
+
+## Vision
+
+A single `ansible-playbook` (or equivalent) turns a bare k3s/kubeadm host into a monitored, agent-ready environment. An LLM agent connects to the MCP bridge and can answer questions like:
+
+- *What is unhealthy right now?*
+- *Which pods are crash-looping and why?*
+- *Is disk or memory pressure building?*
+- *What changed in the cluster since yesterday?*
+
+The agent reasons in **PromQL** and **LogQL** — industry-standard query languages — and calls parameterized tools rather than scraping raw APIs itself.
+
+---
+
+## Design Principles
+
+| Principle | What it means |
+|-----------|---------------|
+| **Read-only by default** | The MCP bridge and its ServiceAccount only `get`/`list`/`watch`. No cluster mutations through this path. |
+| **Standard stack** | Prometheus, Loki, kube-state-metrics, node-exporter — not custom collectors unless necessary. |
+| **MCP as the interface** | One bridge, one contract. Agents do not talk to Prometheus/Loki/K8s APIs directly. |
+| **Deployable in one shot** | Ansible playbook + Helm charts; no manual chart-by-chart assembly. |
+| **Least privilege** | RBAC scoped to observation; NetworkPolicy limits egress; consider mTLS/OIDC for external exposure. |
+| **Agent-first ergonomics** | Pre-built resources (saved queries), tools (parameterized calls), and prompts (triage playbooks). |
+
+---
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  LLM Agent (MCP client)                                     │
+└──────────────────────────┬──────────────────────────────────┘
+                           │ MCP (resources / tools / prompts)
+┌──────────────────────────▼──────────────────────────────────┐
+│  mcp-telemetry-bridge  (FastAPI, namespace: mcp)              │
+│  Read-only proxy to Prometheus, Loki, Kubernetes API          │
+└──────┬─────────────────┬────────────────────┬───────────────┘
+       │                 │                    │
+┌──────▼──────┐  ┌───────▼───────┐   ┌────────▼────────┐
+│ Prometheus  │  │ Loki          │   │ Kubernetes API  │
+│ Alertmanager│  │ Promtail      │   │ (in-cluster SA) │
+│ Grafana     │  │               │   │                 │
+│ KSM         │  │               │   │                 │
+│ node-export │  │               │   │                 │
+└─────────────┘  └───────────────┘   └─────────────────┘
+       monitoring namespace    logging namespace
+```
+
+**Optional:** OpenTelemetry Collector for OTLP fan-out to Prometheus remote-write and Loki.
+
+**Future:** Host-level DaemonSet sidecar for systemd status, package/cert checks, and other node facts not available through K8s metrics alone.
+
+---
+
+## What We Capture
+
+### Minimum viable (current target)
+
+**Kubernetes**
+- Cluster & node status, conditions, taints
+- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images)
+- Services, Events (especially Warning/Error)
+- Resource usage via Prometheus/cAdvisor/kube-state-metrics
+
+**Logs & alerts**
+- Pod and node logs via Loki/Promtail
+- Default alert rules: node not ready, CrashLoopBackOff, API/etcd degradation, job failures
+
+**Bridge surface**
+- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
+- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
+- Prompts: triage and operational playbooks
+
+### Stretch (explicitly deferred)
+
+- Host OS depth: systemd units, package updates, cert expiry, firewall summary, NTP drift
+- `systemd.status`, `tail.pod_logs` tools
+- Alertmanager API integration for active-alert summaries
+- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
+- Multi-cluster federation
+- Write/mutate operations (out of scope unless a separate, gated path is designed)
+
+---
+
+## Repository Layout
+
+| Path | Role |
+|------|------|
+| `ansible/` | Bootstrap: install Helm, deploy all charts |
+| `helm/values/` | Opinionated values for kube-prometheus-stack, Loki, OTel |
+| `helm/mcp-telemetry-bridge/` | Bridge chart: Deployment, RBAC, Service, NetworkPolicy |
+| `mcp-telemetry-bridge/` | FastAPI application implementing the MCP surface |
+| `environments/` | Per-environment overrides and notes |
+| `wiki/` | Extended design notes and blueprint |
+
+---
+
+## Current State (as of initial scaffold)
+
+**Done**
+- Ansible playbook with `k8s_host` + `telemetry_stack` roles
+- Helm values for monitoring, logging, optional OTel collector
+- MCP bridge service with core tools and saved-query resources
+- Read-only ClusterRole/Binding for the bridge ServiceAccount
+- NetworkPolicy skeleton for the bridge
+- Health check and `/mcp/schema` discovery endpoint
+
+**Not yet done / known gaps**
+- Bridge image is a placeholder (`ghcr.io/example/telemcp-bridge`); needs CI build and publish
+- MCP interface is HTTP REST-shaped, not full MCP protocol transport
+- Prompts: only `Triage-Now` stub; missing `Capacity-Check`, `CrashLoop-Playbook`
+- No Alertmanager integration in the bridge
+- No metrics-server chart (useful for `kubectl top` semantics)
+- No host-level DaemonSet sidecar for systemd/OS signals
+- NetworkPolicy egress may need K8s API (443) allowance
+- Wiki and README aligned to INTENT; keep them updated when scope shifts
+
+---
+
+## Success Criteria
+
+We know TeleMcp is working when:
+
+1. `ansible-playbook` brings up monitoring, logging, and bridge namespaces with healthy pods.
+2. `curl /mcp/schema` returns resources, tools, and prompts.
+3. An MCP-capable agent can query PromQL, run LogQL, list cluster objects, and pull an inventory snapshot **without direct API credentials**.
+4. Default alert rules fire on induced failures (node pressure, crash loop) and the agent can triage them via bridge tools.
+5. The entire stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host.
+
+---
+
+## Non-Goals
+
+- Replacing Grafana or building a custom metrics database
+- Providing arbitrary shell/exec access to the cluster
+- Mutating cluster state (deploy, scale, delete) through the bridge
+- Supporting non-Linux or non-Kubernetes targets in v1
+- Vendor-specific APM (Datadog, New Relic, etc.) — OTel fan-out is the extension point
+
+---
+
+## How to Use This Document
+
+- **Prioritize work** against the "Current State" gaps and "Minimum viable" capture list.
+- **Reject scope creep** that does not serve agent observability or repeatable deployment.
+- **Update this file** when intent shifts — e.g., adding write paths, new environments, or MCP transport changes.
+
+For operational quick-start, see [README.md](README.md).  
+For detailed component rationale, see [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md).
--- a/README.md
+++ b/README.md
@@ -1,55 +1,103 @@
 # TeleMcp

-Telemetry + MCP bridge that auto-deploys on a Linux-based Kubernetes host via **Ansible + Helm**.
-It exposes read-only metrics, logs, and k8s object state through an **MCP server** so an LLM agent can bootstrap, monitor, and operate the host.
+**Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
+
+TeleMcp deploys a standard observability stack onto a Linux Kubernetes host via **Ansible + Helm**, then surfaces metrics, logs, and cluster state through a read-only **MCP bridge** so an LLM agent can bootstrap, monitor, triage, and operate the box.
+
+> For project goals, scope, and design principles, see **[INTENT.md](INTENT.md)**.

 ## Components
- **kube-prometheus-stack** (Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics)
- **Loki + Promtail** (logs)
- **OpenTelemetry Collector** (optional fan-out)
- **mcp-telemetry-bridge** (FastAPI service exposing MCP resources/tools/prompts)
+
+| Component | Namespace | Role |
+|-----------|-----------|------|
+| **kube-prometheus-stack** | `monitoring` | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics |
+| **Loki + Promtail** | `logging` | Log aggregation and shipping |
+| **OpenTelemetry Collector** | `observability` | Optional OTLP fan-out to Prometheus and Loki |
+| **mcp-telemetry-bridge** | `mcp` | FastAPI service exposing MCP resources, tools, and prompts |

 ## Quick Start

 ### 0) Prereqs
+
 - Ubuntu 24.04 host with k8s (k3s or kubeadm) reachable and `kubectl` context configured
 - Ansible 2.15+ on your control machine
 - Helm 3 on the host (Ansible role installs if missing)

 ### 1) Run Ansible
+
 ```bash
 cd ansible
 ansible-playbook -i inventories/local.ini playbook.yml
 ```

-### 2) Smoke tests (from any machine with kubectl context)
+### 2) Smoke tests
+
+From any machine with a `kubectl` context:
+
 ```bash
 kubectl get pods -n monitoring
 kubectl get pods -n logging
 kubectl get pods -n mcp
 kubectl port-forward -n mcp svc/mcp-telemetry-bridge 8080:80
 curl http://localhost:8080/mcp/schema | jq .
+curl http://localhost:8080/healthz
 ```

-### 3) Point your LLM Agent
-Configure your agent's MCP client to the service endpoint (ClusterIP/Ingress).
-Use tools:
- `promql.query`
- `loki.query`
- `k8s.get`
- `k8s.events`
- `inventory.snapshot`
+### 3) Point your LLM agent
+
+Configure your agent's MCP client to the bridge endpoint (ClusterIP, Ingress, or port-forward).
+
+**Implemented tools:**
+
+| Tool | Description |
+|------|-------------|
+| `promql.query` | Run a PromQL expression against Prometheus |
+| `loki.query` | Run a LogQL query against Loki |
+| `k8s.get` | Fetch Kubernetes objects (pods, nodes, deployments, etc.) |
+| `k8s.events` | List cluster or namespace events |
+| `inventory.snapshot` | JSON snapshot of nodes, namespaces, and workloads |
+
+**Saved resources** (via `/mcp/resource?uri=...`):
+
+- `res://dashboards/top-pods-by-cpu.promql`
+- `res://dashboards/pod-restarts.promql`
+- `res://dashboards/warn-events.logql`
+
+> The bridge currently exposes an HTTP schema approximation (`/mcp/schema`, `/tools/...`). Full MCP transport (stdio/SSE) is planned — see [INTENT.md](INTENT.md).

 ## Repo layout
+
 ```
 tele-mcp/
-  ansible/
+  INTENT.md                 # Project north star — goals, scope, current state
+  ansible/                  # Bootstrap playbook and roles
  helm/
-  mcp-telemetry-bridge/
-  environments/
+    values/                 # Chart values for monitoring, logging, OTel
+    mcp-telemetry-bridge/   # Bridge Helm chart
+  mcp-telemetry-bridge/       # FastAPI bridge application
+  environments/             # Per-environment overrides
+  wiki/                     # Extended project and design docs
 ```

+## Documentation
+
+| Document | Purpose |
+|----------|---------|
+| [INTENT.md](INTENT.md) | Goals, principles, scope, success criteria |
+| [wiki/TeleMcpProject.md](wiki/TeleMcpProject.md) | Project overview and audience |
+| [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md) | Component rationale and bridge design |
+| [environments/dev/README.md](environments/dev/README.md) | Dev environment notes |
+
 ## Security
- MCP bridge ServiceAccount is read-only (RBAC get/list/watch)
- Optional NetworkPolicy limits egress/ingress
- Consider mTLS/OIDC if exposing outside the cluster
+
+- MCP bridge ServiceAccount is read-only (`get` / `list` / `watch` only)
+- NetworkPolicy limits bridge egress to Prometheus and Loki
+- Consider mTLS or OIDC if exposing the bridge outside the cluster
+
+## Current limitations
+
+See [INTENT.md — Current State](INTENT.md#current-state-as-of-initial-scaffold) for the full list. Notable gaps:
+
+- Bridge container image is a placeholder (`ghcr.io/example/telemcp-bridge`)
+- No Alertmanager integration in the bridge yet
+- Host-level signals (systemd, certs, firewall) are deferred to a future DaemonSet sidecar
--- a/wiki/TeleMcpBlueprint.md
+++ b/wiki/TeleMcpBlueprint.md
@@ -0,0 +1,183 @@
+# TeleMcp Blueprint
+
+*Building a Kubernetes telemetry MCP bridge*
+
+> **Source:** [Original design conversation](https://chatgpt.com/share/68bdf06d-8c2c-8009-90c5-466f9f531d9a)  
+> **Authority:** Scope and priorities are governed by [INTENT.md](../INTENT.md). This document explains *why* each component exists and *how* the bridge is shaped.
+
+## Overview
+
+Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.
+
+MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the [Model Context Protocol spec](https://modelcontextprotocol.io).
+
+---
+
+## What we capture
+
+### Minimum viable (current target)
+
+**Kubernetes (control + workloads)**
+
+- Cluster and node status, taints, conditions, kubelet health
+- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
+- Services, Events (warning/error)
+- Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics
+
+**Logs and alerts**
+
+- Pod and node logs via Loki/Promtail
+- Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures
+
+**Bridge surface**
+
+- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
+- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
+- Prompts: triage and operational playbooks (`Triage-Now` implemented; others planned)
+
+### Stretch (deferred)
+
+**Host (Linux / node)**
+
+- CPU, memory, disk, inode, filesystem, network, NIC errors *(partially covered by node-exporter)*
+- Distro/kernel/version, packages/updates
+- Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
+- Certificates (expiry), time sync status (chrony/ntp)
+- Firewall/ports (nftables/ufw summary)
+
+**Additional Kubernetes signals**
+
+- Ingress, Jobs/CronJobs, HPA/VPA
+- Throttling and OOM kill detail beyond default metrics
+
+**Additional bridge capabilities**
+
+- `systemd.status`, `tail.pod_logs` tools
+- Alertmanager API for active-alert summaries
+- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
+
+---
+
+## Reference architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│  LLM Agent (MCP client)                                     │
+└──────────────────────────┬──────────────────────────────────┘
+                           │ MCP (resources / tools / prompts)
+┌──────────────────────────▼──────────────────────────────────┐
+│  mcp-telemetry-bridge  (FastAPI, namespace: mcp)              │
+│  Read-only proxy to Prometheus, Loki, Kubernetes API          │
+└──────┬─────────────────┬────────────────────┬───────────────┘
+       │                 │                    │
+┌──────▼──────┐  ┌───────▼───────┐   ┌────────▼────────┐
+│ Prometheus  │  │ Loki          │   │ Kubernetes API  │
+│ Alertmanager│  │ Promtail      │   │ (in-cluster SA) │
+│ Grafana     │  │               │   │                 │
+│ KSM         │  │               │   │                 │
+│ node-export │  │               │   │                 │
+└─────────────┘  └───────────────┘   └─────────────────┘
+```
+
+### On the cluster
+
+| Component | Status | Role |
+|-----------|--------|------|
+| [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) | **Deployed** | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules |
+| [Loki](https://grafana.com/docs/loki/latest/) + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) | **Deployed** | Log aggregation and shipping |
+| [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) | **Deployed (optional)** | OTLP in → Prometheus remote-write / Loki out |
+| [metrics-server](https://github.com/kubernetes-sigs/metrics-server) | Planned | Live resource metrics (`kubectl top` semantics) |
+| Host DaemonSet sidecar | Planned | systemd, cert, and OS-level facts |
+
+We use standard CNCF pieces so agents reason in **PromQL** and **LogQL** and call a single MCP server for answers.
+
+---
+
+## Why these charts?
+
+| Chart | Rationale |
+|-------|-----------|
+| **kube-prometheus-stack** | One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules |
+| **Loki + Promtail** | Cheap, scalable log storage without bolting logs into Prometheus |
+| **OTel Collector** | Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting |
+
+Ansible copies opinionated values from `helm/values/` and runs `helm upgrade --install` for each chart. See `ansible/roles/telemetry_stack/tasks/main.yml`.
+
+---
+
+## MCP Telemetry Bridge
+
+The bridge (`mcp-telemetry-bridge/`) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).
+
+### Implementation status
+
+| Capability | Status |
+|------------|--------|
+| FastAPI service with health check | Done |
+| `/mcp/schema` discovery endpoint | Done |
+| `promql.query` | Done |
+| `loki.query` | Done |
+| `k8s.get` | Done |
+| `k8s.events` | Done |
+| `inventory.snapshot` | Done |
+| Saved PromQL/LogQL resources | Done (3 queries) |
+| `Triage-Now` prompt | Stub |
+| `Capacity-Check`, `CrashLoop-Playbook` prompts | Planned |
+| `systemd.status` | Planned (requires DaemonSet sidecar) |
+| `tail.pod_logs` | Planned |
+| Alertmanager API | Planned |
+| Full MCP protocol transport | Planned |
+
+### Read-only backends
+
+The bridge talks read-only to:
+
+- **Prometheus HTTP API** — instant and range queries
+- **Loki HTTP API** — LogQL queries
+- **Kubernetes API** — ServiceAccount with RBAC `get`/`list`/`watch`
+- **Alertmanager API** — planned for active-alert summaries
+- **Node sidecar HTTP** — planned for host-level facts
+
+### Tools (target API)
+
+```
+promql.query(expr, range?)
+loki.query(logql, limit?, since?)
+k8s.get(kind, namespace?, name?)
+k8s.events(namespace?, since?)
+inventory.snapshot() → JSON
+systemd.status(unit)          # planned
+```
+
+### Resources
+
+```
+res://dashboards/top-pods-by-cpu.promql    # implemented
+res://dashboards/pod-restarts.promql       # implemented
+res://dashboards/warn-events.logql         # implemented
+res://snapshots/cluster-inventory.json     # planned (dynamic)
+```
+
+### Prompts
+
+```
+Triage-Now           # stub — summarize alerts, top offenders, recent warnings
+Capacity-Check       # planned
+CrashLoop-Playbook   # planned
+```
+
+---
+
+## Security model
+
+- Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to `get`/`list`/`watch`
+- NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
+- External exposure should use mTLS or OIDC — the bridge is not authenticated in v1
+
+---
+
+## Related docs
+
+- [INTENT.md](../INTENT.md) — goals, scope, success criteria, known gaps
+- [README.md](../README.md) — quick start and smoke tests
+- [TeleMcpProject.md](TeleMcpProject.md) — project overview and audience
--- a/wiki/TeleMcpProject.md
+++ b/wiki/TeleMcpProject.md
@@ -0,0 +1,73 @@
+# TeleMcp Project
+
+*Telemetry for autonomous control*
+
+## What is TeleMcp?
+
+TeleMcp is **mission control for Kubernetes hosts**. It collects health, performance, and alert signals from a Linux k8s cluster and exposes them through a single **Model Context Protocol (MCP)** interface so intelligent assistants can understand what's happening, triage problems, and help keep systems running smoothly — without constant human supervision.
+
+The project name reflects its two halves:
+
+- **Tele** — telemetry: metrics, logs, events, and cluster inventory
+- **MCP** — the standardized bridge between observability backends and LLM agents
+
+## Who is it for?
+
+- **Operators** who want repeatable, one-command observability on a k3s or kubeadm host
+- **LLM agent builders** who need a safe, read-only API for cluster situational awareness
+- **Developers** running local or edge Kubernetes who want agent-assisted monitoring without wiring up bespoke integrations
+
+## What problem does it solve?
+
+Running a Kubernetes host means tracking signals across many systems. Humans reach for Grafana, `kubectl`, and ad-hoc PromQL. Agents need the same information through a **standardized, safe contract** — not raw shell access or scattered API credentials.
+
+TeleMcp solves this in three steps:
+
+1. **Collect** — deploy Prometheus, Loki, and supporting exporters via Helm
+2. **Deploy** — bootstrap everything with a single Ansible playbook
+3. **Bridge** — expose resources, tools, and prompts through `mcp-telemetry-bridge`
+
+## What can an agent do today?
+
+With the current scaffold, an agent connected to the bridge can:
+
+- Query Prometheus with `promql.query`
+- Search logs with `loki.query`
+- Inspect Kubernetes objects with `k8s.get` and `k8s.events`
+- Pull a cluster inventory snapshot with `inventory.snapshot`
+- Use pre-built PromQL/LogQL resources for common triage queries
+
+## What is planned?
+
+Stretch goals — explicitly deferred in v1 — include host-level signals (systemd status, cert expiry, firewall summary), Alertmanager integration, additional prompts (`Capacity-Check`, `CrashLoop-Playbook`), and full MCP protocol transport. See [INTENT.md](../INTENT.md) for the authoritative scope list.
+
+## Design principles
+
+| Principle | Summary |
+|-----------|---------|
+| Read-only by default | No cluster mutations through the bridge |
+| Standard stack | CNCF/Grafana components, not custom collectors |
+| MCP as the interface | One bridge, one contract for agents |
+| Deployable in one shot | Ansible + Helm, no manual assembly |
+| Least privilege | Scoped RBAC and NetworkPolicy |
+
+## Repository map
+
+| Path | Contents |
+|------|----------|
+| [INTENT.md](../INTENT.md) | North star — goals, scope, current state |
+| [README.md](../README.md) | Quick start and operational guide |
+| [TeleMcpBlueprint.md](TeleMcpBlueprint.md) | Architecture and component rationale |
+| `ansible/` | Bootstrap playbook |
+| `helm/` | Chart values and bridge chart |
+| `mcp-telemetry-bridge/` | FastAPI bridge source |
+
+## Success criteria
+
+TeleMcp is working when:
+
+1. `ansible-playbook` brings up healthy pods in `monitoring`, `logging`, and `mcp` namespaces
+2. `/mcp/schema` returns resources, tools, and prompts
+3. An agent can query metrics, logs, and cluster state without direct API credentials
+4. Default alert rules fire on induced failures and the agent can triage them
+5. The stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host