Files
tele-mcp/wiki/TeleMcpBlueprint.md
2026-06-22 19:09:24 +02:00

7.6 KiB

TeleMcp Blueprint

Building a Kubernetes telemetry MCP bridge

Source: Original design conversation
Authority: Scope and priorities are governed by INTENT.md. This document explains why each component exists and how the bridge is shaped.

Overview

Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.

MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the Model Context Protocol spec.


What we capture

Minimum viable (current target)

Kubernetes (control + workloads)

  • Cluster and node status, taints, conditions, kubelet health
  • Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
  • Services, Events (warning/error)
  • Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics

Logs and alerts

  • Pod and node logs via Loki/Promtail
  • Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures

Bridge surface

  • Tools: promql.query, loki.query, k8s.get, k8s.events, inventory.snapshot
  • Resources: saved PromQL/LogQL queries, cluster inventory snapshots
  • Prompts: triage and operational playbooks (Triage-Now implemented; others planned)

Stretch (deferred)

Host (Linux / node)

  • CPU, memory, disk, inode, filesystem, network, NIC errors (partially covered by node-exporter)
  • Distro/kernel/version, packages/updates
  • Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
  • Certificates (expiry), time sync status (chrony/ntp)
  • Firewall/ports (nftables/ufw summary)

Additional Kubernetes signals

  • Ingress, Jobs/CronJobs, HPA/VPA
  • Throttling and OOM kill detail beyond default metrics

Additional bridge capabilities

  • systemd.status, tail.pod_logs tools
  • Alertmanager API for active-alert summaries
  • Full MCP transport (stdio/SSE) vs. current HTTP schema approximation

Reference architecture

┌─────────────────────────────────────────────────────────────┐
│  LLM Agent (MCP client)                                     │
└──────────────────────────┬──────────────────────────────────┘
                           │ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│  mcp-telemetry-bridge  (FastAPI, namespace: mcp)              │
│  Read-only proxy to Prometheus, Loki, Kubernetes API          │
└──────┬─────────────────┬────────────────────┬───────────────┘
       │                 │                    │
┌──────▼──────┐  ┌───────▼───────┐   ┌────────▼────────┐
│ Prometheus  │  │ Loki          │   │ Kubernetes API  │
│ Alertmanager│  │ Promtail      │   │ (in-cluster SA) │
│ Grafana     │  │               │   │                 │
│ KSM         │  │               │   │                 │
│ node-export │  │               │   │                 │
└─────────────┘  └───────────────┘   └─────────────────┘

On the cluster

Component Status Role
kube-prometheus-stack Deployed Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules
Loki + Promtail Deployed Log aggregation and shipping
OpenTelemetry Collector Deployed (optional) OTLP in → Prometheus remote-write / Loki out
metrics-server Planned Live resource metrics (kubectl top semantics)
Host DaemonSet sidecar Planned systemd, cert, and OS-level facts

We use standard CNCF pieces so agents reason in PromQL and LogQL and call a single MCP server for answers.


Why these charts?

Chart Rationale
kube-prometheus-stack One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules
Loki + Promtail Cheap, scalable log storage without bolting logs into Prometheus
OTel Collector Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting

Ansible copies opinionated values from helm/values/ and runs helm upgrade --install for each chart. See ansible/roles/telemetry_stack/tasks/main.yml.


MCP Telemetry Bridge

The bridge (mcp-telemetry-bridge/) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).

Implementation status

Capability Status
FastAPI service with health check Done
/mcp/schema discovery endpoint Done
promql.query Done
loki.query Done
k8s.get Done
k8s.events Done
inventory.snapshot Done
Saved PromQL/LogQL resources Done (3 queries)
Triage-Now prompt Stub
Capacity-Check, CrashLoop-Playbook prompts Planned
systemd.status Planned (requires DaemonSet sidecar)
tail.pod_logs Planned
Alertmanager API Planned
Full MCP protocol transport Planned

Read-only backends

The bridge talks read-only to:

  • Prometheus HTTP API — instant and range queries
  • Loki HTTP API — LogQL queries
  • Kubernetes API — ServiceAccount with RBAC get/list/watch
  • Alertmanager API — planned for active-alert summaries
  • Node sidecar HTTP — planned for host-level facts

Tools (target API)

promql.query(expr, range?)
loki.query(logql, limit?, since?)
k8s.get(kind, namespace?, name?)
k8s.events(namespace?, since?)
inventory.snapshot() → JSON
systemd.status(unit)          # planned

Resources

res://dashboards/top-pods-by-cpu.promql    # implemented
res://dashboards/pod-restarts.promql       # implemented
res://dashboards/warn-events.logql         # implemented
res://snapshots/cluster-inventory.json     # planned (dynamic)

Prompts

Triage-Now           # stub — summarize alerts, top offenders, recent warnings
Capacity-Check       # planned
CrashLoop-Playbook   # planned

Security model

  • Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to get/list/watch
  • NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
  • External exposure should use mTLS or OIDC — the bridge is not authenticated in v1