tele-mcp/INTENT.md

# TeleMcp — Project Intent

> **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**

TeleMcp is a self-contained observability stack that deploys onto a Linux Kubernetes host and surfaces metrics, logs, and cluster state through a single **Model Context Protocol (MCP)** bridge. The goal is to let an autonomous agent — or a human with an agent — **bootstrap, monitor, triage, and operate** a box without bespoke integrations or constant human supervision.

This document anchors what we are building, why, and what is in scope. When in doubt, prefer the simplest path that gives an agent reliable, read-only situational awareness.

---

## Problem

Operating a Kubernetes host means juggling many signals across many systems: node health, workload status, logs, alerts, certificates, systemd units, and more. Humans use Grafana dashboards, `kubectl`, and ad-hoc PromQL/LogQL. LLM agents need the same information, but through a **standardized, safe interface** — not raw shell access.

TeleMcp closes that gap by:

1. **Collecting** telemetry with proven CNCF/Grafana stack components.
2. **Deploying** the stack repeatably via Ansible + Helm.
3. **Bridging** everything to agents through one MCP server with resources, tools, and prompts.

---

## Vision

A single `ansible-playbook` (or equivalent) turns a bare k3s/kubeadm host into a monitored, agent-ready environment. An LLM agent connects to the MCP bridge and can answer questions like:

- *What is unhealthy right now?*
- *Which pods are crash-looping and why?*
- *Is disk or memory pressure building?*
- *What changed in the cluster since yesterday?*

The agent reasons in **PromQL** and **LogQL** — industry-standard query languages — and calls parameterized tools rather than scraping raw APIs itself.

---

## Design Principles

| Principle | What it means |
|-----------|---------------|
| **Read-only by default** | The MCP bridge and its ServiceAccount only `get`/`list`/`watch`. No cluster mutations through this path. |
| **Standard stack** | Prometheus, Loki, kube-state-metrics, node-exporter — not custom collectors unless necessary. |
| **MCP as the interface** | One bridge, one contract. Agents do not talk to Prometheus/Loki/K8s APIs directly. |
| **Deployable in one shot** | Ansible playbook + Helm charts; no manual chart-by-chart assembly. |
| **Least privilege** | RBAC scoped to observation; NetworkPolicy limits egress; consider mTLS/OIDC for external exposure. |
| **Agent-first ergonomics** | Pre-built resources (saved queries), tools (parameterized calls), and prompts (triage playbooks). |

---

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│  LLM Agent (MCP client)                                     │
└──────────────────────────┬──────────────────────────────────┘
                           │ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│  mcp-telemetry-bridge  (FastAPI, namespace: mcp)              │
│  Read-only proxy to Prometheus, Loki, Kubernetes API          │
└──────┬─────────────────┬────────────────────┬───────────────┘
       │                 │                    │
┌──────▼──────┐  ┌───────▼───────┐   ┌────────▼────────┐
│ Prometheus  │  │ Loki          │   │ Kubernetes API  │
│ Alertmanager│  │ Promtail      │   │ (in-cluster SA) │
│ Grafana     │  │               │   │                 │
│ KSM         │  │               │   │                 │
│ node-export │  │               │   │                 │
└─────────────┘  └───────────────┘   └─────────────────┘
       monitoring namespace    logging namespace
```

**Optional:** OpenTelemetry Collector for OTLP fan-out to Prometheus remote-write and Loki.

**Future:** Host-level DaemonSet sidecar for systemd status, package/cert checks, and other node facts not available through K8s metrics alone.

---

## What We Capture

### Minimum viable (current target)

**Kubernetes**
- Cluster & node status, conditions, taints
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images)
- Services, Events (especially Warning/Error)
- Resource usage via Prometheus/cAdvisor/kube-state-metrics

**Logs & alerts**
- Pod and node logs via Loki/Promtail
- Default alert rules: node not ready, CrashLoopBackOff, API/etcd degradation, job failures

**Bridge surface**
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
- Prompts: triage and operational playbooks

### Stretch (explicitly deferred)

- Host OS depth: systemd units, package updates, cert expiry, firewall summary, NTP drift
- `systemd.status`, `tail.pod_logs` tools
- Alertmanager API integration for active-alert summaries
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
- Multi-cluster federation
- Write/mutate operations (out of scope unless a separate, gated path is designed)

---

## Repository Layout

| Path | Role |
|------|------|
| `ansible/` | Bootstrap: install Helm, deploy all charts |
| `helm/values/` | Opinionated values for kube-prometheus-stack, Loki, OTel |
| `helm/mcp-telemetry-bridge/` | Bridge chart: Deployment, RBAC, Service, NetworkPolicy |
| `mcp-telemetry-bridge/` | FastAPI application implementing the MCP surface |
| `environments/` | Per-environment overrides and notes |
| `wiki/` | Extended design notes and blueprint |

---

## Current State (as of initial scaffold)

**Done**
- Ansible playbook with `k8s_host` + `telemetry_stack` roles
- Helm values for monitoring, logging, optional OTel collector
- MCP bridge service with core tools and saved-query resources
- Read-only ClusterRole/Binding for the bridge ServiceAccount
- NetworkPolicy skeleton for the bridge
- Health check and `/mcp/schema` discovery endpoint

**Not yet done / known gaps**
- Bridge image is a placeholder (`ghcr.io/example/telemcp-bridge`); needs CI build and publish
- MCP interface is HTTP REST-shaped, not full MCP protocol transport
- Prompts: only `Triage-Now` stub; missing `Capacity-Check`, `CrashLoop-Playbook`
- No Alertmanager integration in the bridge
- No metrics-server chart (useful for `kubectl top` semantics)
- No host-level DaemonSet sidecar for systemd/OS signals
- NetworkPolicy egress may need K8s API (443) allowance
- Wiki and README aligned to INTENT; keep them updated when scope shifts

---

## Success Criteria

We know TeleMcp is working when:

1. `ansible-playbook` brings up monitoring, logging, and bridge namespaces with healthy pods.
2. `curl /mcp/schema` returns resources, tools, and prompts.
3. An MCP-capable agent can query PromQL, run LogQL, list cluster objects, and pull an inventory snapshot **without direct API credentials**.
4. Default alert rules fire on induced failures (node pressure, crash loop) and the agent can triage them via bridge tools.
5. The entire stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host.

---

## Non-Goals

- Replacing Grafana or building a custom metrics database
- Providing arbitrary shell/exec access to the cluster
- Mutating cluster state (deploy, scale, delete) through the bridge
- Supporting non-Linux or non-Kubernetes targets in v1
- Vendor-specific APM (Datadog, New Relic, etc.) — OTel fan-out is the extension point

---

## How to Use This Document

- **Prioritize work** against the "Current State" gaps and "Minimum viable" capture list.
- **Reject scope creep** that does not serve agent observability or repeatable deployment.
- **Update this file** when intent shifts — e.g., adding write paths, new environments, or MCP transport changes.

For operational quick-start, see [README.md](README.md).
For detailed component rationale, see [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md).