Files
tele-mcp/INTENT.md
2026-06-22 19:09:24 +02:00

8.4 KiB

TeleMcp — Project Intent

Mission control for Kubernetes hosts, exposed to LLM agents through MCP.

TeleMcp is a self-contained observability stack that deploys onto a Linux Kubernetes host and surfaces metrics, logs, and cluster state through a single Model Context Protocol (MCP) bridge. The goal is to let an autonomous agent — or a human with an agent — bootstrap, monitor, triage, and operate a box without bespoke integrations or constant human supervision.

This document anchors what we are building, why, and what is in scope. When in doubt, prefer the simplest path that gives an agent reliable, read-only situational awareness.


Problem

Operating a Kubernetes host means juggling many signals across many systems: node health, workload status, logs, alerts, certificates, systemd units, and more. Humans use Grafana dashboards, kubectl, and ad-hoc PromQL/LogQL. LLM agents need the same information, but through a standardized, safe interface — not raw shell access.

TeleMcp closes that gap by:

  1. Collecting telemetry with proven CNCF/Grafana stack components.
  2. Deploying the stack repeatably via Ansible + Helm.
  3. Bridging everything to agents through one MCP server with resources, tools, and prompts.

Vision

A single ansible-playbook (or equivalent) turns a bare k3s/kubeadm host into a monitored, agent-ready environment. An LLM agent connects to the MCP bridge and can answer questions like:

  • What is unhealthy right now?
  • Which pods are crash-looping and why?
  • Is disk or memory pressure building?
  • What changed in the cluster since yesterday?

The agent reasons in PromQL and LogQL — industry-standard query languages — and calls parameterized tools rather than scraping raw APIs itself.


Design Principles

Principle What it means
Read-only by default The MCP bridge and its ServiceAccount only get/list/watch. No cluster mutations through this path.
Standard stack Prometheus, Loki, kube-state-metrics, node-exporter — not custom collectors unless necessary.
MCP as the interface One bridge, one contract. Agents do not talk to Prometheus/Loki/K8s APIs directly.
Deployable in one shot Ansible playbook + Helm charts; no manual chart-by-chart assembly.
Least privilege RBAC scoped to observation; NetworkPolicy limits egress; consider mTLS/OIDC for external exposure.
Agent-first ergonomics Pre-built resources (saved queries), tools (parameterized calls), and prompts (triage playbooks).

Architecture

┌─────────────────────────────────────────────────────────────┐
│  LLM Agent (MCP client)                                     │
└──────────────────────────┬──────────────────────────────────┘
                           │ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│  mcp-telemetry-bridge  (FastAPI, namespace: mcp)              │
│  Read-only proxy to Prometheus, Loki, Kubernetes API          │
└──────┬─────────────────┬────────────────────┬───────────────┘
       │                 │                    │
┌──────▼──────┐  ┌───────▼───────┐   ┌────────▼────────┐
│ Prometheus  │  │ Loki          │   │ Kubernetes API  │
│ Alertmanager│  │ Promtail      │   │ (in-cluster SA) │
│ Grafana     │  │               │   │                 │
│ KSM         │  │               │   │                 │
│ node-export │  │               │   │                 │
└─────────────┘  └───────────────┘   └─────────────────┘
       monitoring namespace    logging namespace

Optional: OpenTelemetry Collector for OTLP fan-out to Prometheus remote-write and Loki.

Future: Host-level DaemonSet sidecar for systemd status, package/cert checks, and other node facts not available through K8s metrics alone.


What We Capture

Minimum viable (current target)

Kubernetes

  • Cluster & node status, conditions, taints
  • Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images)
  • Services, Events (especially Warning/Error)
  • Resource usage via Prometheus/cAdvisor/kube-state-metrics

Logs & alerts

  • Pod and node logs via Loki/Promtail
  • Default alert rules: node not ready, CrashLoopBackOff, API/etcd degradation, job failures

Bridge surface

  • Tools: promql.query, loki.query, k8s.get, k8s.events, inventory.snapshot
  • Resources: saved PromQL/LogQL queries, cluster inventory snapshots
  • Prompts: triage and operational playbooks

Stretch (explicitly deferred)

  • Host OS depth: systemd units, package updates, cert expiry, firewall summary, NTP drift
  • systemd.status, tail.pod_logs tools
  • Alertmanager API integration for active-alert summaries
  • Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
  • Multi-cluster federation
  • Write/mutate operations (out of scope unless a separate, gated path is designed)

Repository Layout

Path Role
ansible/ Bootstrap: install Helm, deploy all charts
helm/values/ Opinionated values for kube-prometheus-stack, Loki, OTel
helm/mcp-telemetry-bridge/ Bridge chart: Deployment, RBAC, Service, NetworkPolicy
mcp-telemetry-bridge/ FastAPI application implementing the MCP surface
environments/ Per-environment overrides and notes
wiki/ Extended design notes and blueprint

Current State (as of initial scaffold)

Done

  • Ansible playbook with k8s_host + telemetry_stack roles
  • Helm values for monitoring, logging, optional OTel collector
  • MCP bridge service with core tools and saved-query resources
  • Read-only ClusterRole/Binding for the bridge ServiceAccount
  • NetworkPolicy skeleton for the bridge
  • Health check and /mcp/schema discovery endpoint

Not yet done / known gaps

  • Bridge image is a placeholder (ghcr.io/example/telemcp-bridge); needs CI build and publish
  • MCP interface is HTTP REST-shaped, not full MCP protocol transport
  • Prompts: only Triage-Now stub; missing Capacity-Check, CrashLoop-Playbook
  • No Alertmanager integration in the bridge
  • No metrics-server chart (useful for kubectl top semantics)
  • No host-level DaemonSet sidecar for systemd/OS signals
  • NetworkPolicy egress may need K8s API (443) allowance
  • Wiki and README aligned to INTENT; keep them updated when scope shifts

Success Criteria

We know TeleMcp is working when:

  1. ansible-playbook brings up monitoring, logging, and bridge namespaces with healthy pods.
  2. curl /mcp/schema returns resources, tools, and prompts.
  3. An MCP-capable agent can query PromQL, run LogQL, list cluster objects, and pull an inventory snapshot without direct API credentials.
  4. Default alert rules fire on induced failures (node pressure, crash loop) and the agent can triage them via bridge tools.
  5. The entire stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host.

Non-Goals

  • Replacing Grafana or building a custom metrics database
  • Providing arbitrary shell/exec access to the cluster
  • Mutating cluster state (deploy, scale, delete) through the bridge
  • Supporting non-Linux or non-Kubernetes targets in v1
  • Vendor-specific APM (Datadog, New Relic, etc.) — OTel fan-out is the extension point

How to Use This Document

  • Prioritize work against the "Current State" gaps and "Minimum viable" capture list.
  • Reject scope creep that does not serve agent observability or repeatable deployment.
  • Update this file when intent shifts — e.g., adding write paths, new environments, or MCP transport changes.

For operational quick-start, see README.md.
For detailed component rationale, see wiki/TeleMcpBlueprint.md.