Files

tegwick 129a229e38 Seeded intent and wiki pages

2026-06-22 19:09:24 +02:00

8.4 KiB

Raw Blame History

TeleMcp — Project Intent

Mission control for Kubernetes hosts, exposed to LLM agents through MCP.

TeleMcp is a self-contained observability stack that deploys onto a Linux Kubernetes host and surfaces metrics, logs, and cluster state through a single Model Context Protocol (MCP) bridge. The goal is to let an autonomous agent — or a human with an agent — bootstrap, monitor, triage, and operate a box without bespoke integrations or constant human supervision.

This document anchors what we are building, why, and what is in scope. When in doubt, prefer the simplest path that gives an agent reliable, read-only situational awareness.

Problem

Operating a Kubernetes host means juggling many signals across many systems: node health, workload status, logs, alerts, certificates, systemd units, and more. Humans use Grafana dashboards, kubectl, and ad-hoc PromQL/LogQL. LLM agents need the same information, but through a standardized, safe interface — not raw shell access.

TeleMcp closes that gap by:

Collecting telemetry with proven CNCF/Grafana stack components.
Deploying the stack repeatably via Ansible + Helm.
Bridging everything to agents through one MCP server with resources, tools, and prompts.

Vision

A single ansible-playbook (or equivalent) turns a bare k3s/kubeadm host into a monitored, agent-ready environment. An LLM agent connects to the MCP bridge and can answer questions like:

What is unhealthy right now?
Which pods are crash-looping and why?
Is disk or memory pressure building?
What changed in the cluster since yesterday?

The agent reasons in PromQL and LogQL — industry-standard query languages — and calls parameterized tools rather than scraping raw APIs itself.

Design Principles

Principle	What it means
Read-only by default	The MCP bridge and its ServiceAccount only `get`/`list`/`watch`. No cluster mutations through this path.
Standard stack	Prometheus, Loki, kube-state-metrics, node-exporter — not custom collectors unless necessary.
MCP as the interface	One bridge, one contract. Agents do not talk to Prometheus/Loki/K8s APIs directly.
Deployable in one shot	Ansible playbook + Helm charts; no manual chart-by-chart assembly.
Least privilege	RBAC scoped to observation; NetworkPolicy limits egress; consider mTLS/OIDC for external exposure.
Agent-first ergonomics	Pre-built resources (saved queries), tools (parameterized calls), and prompts (triage playbooks).

Architecture

┌─────────────────────────────────────────────────────────────┐
│  LLM Agent (MCP client)                                     │
└──────────────────────────┬──────────────────────────────────┘
                           │ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│  mcp-telemetry-bridge  (FastAPI, namespace: mcp)              │
│  Read-only proxy to Prometheus, Loki, Kubernetes API          │
└──────┬─────────────────┬────────────────────┬───────────────┘
       │                 │                    │
┌──────▼──────┐  ┌───────▼───────┐   ┌────────▼────────┐
│ Prometheus  │  │ Loki          │   │ Kubernetes API  │
│ Alertmanager│  │ Promtail      │   │ (in-cluster SA) │
│ Grafana     │  │               │   │                 │
│ KSM         │  │               │   │                 │
│ node-export │  │               │   │                 │
└─────────────┘  └───────────────┘   └─────────────────┘
       monitoring namespace    logging namespace

Optional: OpenTelemetry Collector for OTLP fan-out to Prometheus remote-write and Loki.

Future: Host-level DaemonSet sidecar for systemd status, package/cert checks, and other node facts not available through K8s metrics alone.

What We Capture

Minimum viable (current target)

Kubernetes

Cluster & node status, conditions, taints
Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images)
Services, Events (especially Warning/Error)
Resource usage via Prometheus/cAdvisor/kube-state-metrics

Logs & alerts

Pod and node logs via Loki/Promtail
Default alert rules: node not ready, CrashLoopBackOff, API/etcd degradation, job failures

Bridge surface

Tools: promql.query, loki.query, k8s.get, k8s.events, inventory.snapshot
Resources: saved PromQL/LogQL queries, cluster inventory snapshots
Prompts: triage and operational playbooks

Stretch (explicitly deferred)

Host OS depth: systemd units, package updates, cert expiry, firewall summary, NTP drift
systemd.status, tail.pod_logs tools
Alertmanager API integration for active-alert summaries
Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
Multi-cluster federation
Write/mutate operations (out of scope unless a separate, gated path is designed)

Repository Layout

Path	Role
`ansible/`	Bootstrap: install Helm, deploy all charts
`helm/values/`	Opinionated values for kube-prometheus-stack, Loki, OTel
`helm/mcp-telemetry-bridge/`	Bridge chart: Deployment, RBAC, Service, NetworkPolicy
`mcp-telemetry-bridge/`	FastAPI application implementing the MCP surface
`environments/`	Per-environment overrides and notes
`wiki/`	Extended design notes and blueprint

Current State (as of initial scaffold)

Done

Ansible playbook with k8s_host + telemetry_stack roles
Helm values for monitoring, logging, optional OTel collector
MCP bridge service with core tools and saved-query resources
Read-only ClusterRole/Binding for the bridge ServiceAccount
NetworkPolicy skeleton for the bridge
Health check and /mcp/schema discovery endpoint

Not yet done / known gaps

Bridge image is a placeholder (ghcr.io/example/telemcp-bridge); needs CI build and publish
MCP interface is HTTP REST-shaped, not full MCP protocol transport
Prompts: only Triage-Now stub; missing Capacity-Check, CrashLoop-Playbook
No Alertmanager integration in the bridge
No metrics-server chart (useful for kubectl top semantics)
No host-level DaemonSet sidecar for systemd/OS signals
NetworkPolicy egress may need K8s API (443) allowance
Wiki and README aligned to INTENT; keep them updated when scope shifts

Success Criteria

We know TeleMcp is working when:

ansible-playbook brings up monitoring, logging, and bridge namespaces with healthy pods.
curl /mcp/schema returns resources, tools, and prompts.
An MCP-capable agent can query PromQL, run LogQL, list cluster objects, and pull an inventory snapshot without direct API credentials.
Default alert rules fire on induced failures (node pressure, crash loop) and the agent can triage them via bridge tools.
The entire stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host.

Non-Goals

Replacing Grafana or building a custom metrics database
Providing arbitrary shell/exec access to the cluster
Mutating cluster state (deploy, scale, delete) through the bridge
Supporting non-Linux or non-Kubernetes targets in v1
Vendor-specific APM (Datadog, New Relic, etc.) — OTel fan-out is the extension point

How to Use This Document

Prioritize work against the "Current State" gaps and "Minimum viable" capture list.
Reject scope creep that does not serve agent observability or repeatable deployment.
Update this file when intent shifts — e.g., adding write paths, new environments, or MCP transport changes.

For operational quick-start, see README.md.
For detailed component rationale, see wiki/TeleMcpBlueprint.md.

8.4 KiB Raw Blame History