7.6 KiB
TeleMcp Blueprint
Building a Kubernetes telemetry MCP bridge
Source: Original design conversation
Authority: Scope and priorities are governed by INTENT.md. This document explains why each component exists and how the bridge is shaped.
Overview
Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.
MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the Model Context Protocol spec.
What we capture
Minimum viable (current target)
Kubernetes (control + workloads)
- Cluster and node status, taints, conditions, kubelet health
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
- Services, Events (warning/error)
- Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics
Logs and alerts
- Pod and node logs via Loki/Promtail
- Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures
Bridge surface
- Tools:
promql.query,loki.query,k8s.get,k8s.events,inventory.snapshot - Resources: saved PromQL/LogQL queries, cluster inventory snapshots
- Prompts: triage and operational playbooks (
Triage-Nowimplemented; others planned)
Stretch (deferred)
Host (Linux / node)
- CPU, memory, disk, inode, filesystem, network, NIC errors (partially covered by node-exporter)
- Distro/kernel/version, packages/updates
- Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
- Certificates (expiry), time sync status (chrony/ntp)
- Firewall/ports (nftables/ufw summary)
Additional Kubernetes signals
- Ingress, Jobs/CronJobs, HPA/VPA
- Throttling and OOM kill detail beyond default metrics
Additional bridge capabilities
systemd.status,tail.pod_logstools- Alertmanager API for active-alert summaries
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
Reference architecture
┌─────────────────────────────────────────────────────────────┐
│ LLM Agent (MCP client) │
└──────────────────────────┬──────────────────────────────────┘
│ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
└──────┬─────────────────┬────────────────────┬───────────────┘
│ │ │
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
│ Prometheus │ │ Loki │ │ Kubernetes API │
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
│ Grafana │ │ │ │ │
│ KSM │ │ │ │ │
│ node-export │ │ │ │ │
└─────────────┘ └───────────────┘ └─────────────────┘
On the cluster
| Component | Status | Role |
|---|---|---|
| kube-prometheus-stack | Deployed | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules |
| Loki + Promtail | Deployed | Log aggregation and shipping |
| OpenTelemetry Collector | Deployed (optional) | OTLP in → Prometheus remote-write / Loki out |
| metrics-server | Planned | Live resource metrics (kubectl top semantics) |
| Host DaemonSet sidecar | Planned | systemd, cert, and OS-level facts |
We use standard CNCF pieces so agents reason in PromQL and LogQL and call a single MCP server for answers.
Why these charts?
| Chart | Rationale |
|---|---|
| kube-prometheus-stack | One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules |
| Loki + Promtail | Cheap, scalable log storage without bolting logs into Prometheus |
| OTel Collector | Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting |
Ansible copies opinionated values from helm/values/ and runs helm upgrade --install for each chart. See ansible/roles/telemetry_stack/tasks/main.yml.
MCP Telemetry Bridge
The bridge (mcp-telemetry-bridge/) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).
Implementation status
| Capability | Status |
|---|---|
| FastAPI service with health check | Done |
/mcp/schema discovery endpoint |
Done |
promql.query |
Done |
loki.query |
Done |
k8s.get |
Done |
k8s.events |
Done |
inventory.snapshot |
Done |
| Saved PromQL/LogQL resources | Done (3 queries) |
Triage-Now prompt |
Stub |
Capacity-Check, CrashLoop-Playbook prompts |
Planned |
systemd.status |
Planned (requires DaemonSet sidecar) |
tail.pod_logs |
Planned |
| Alertmanager API | Planned |
| Full MCP protocol transport | Planned |
Read-only backends
The bridge talks read-only to:
- Prometheus HTTP API — instant and range queries
- Loki HTTP API — LogQL queries
- Kubernetes API — ServiceAccount with RBAC
get/list/watch - Alertmanager API — planned for active-alert summaries
- Node sidecar HTTP — planned for host-level facts
Tools (target API)
promql.query(expr, range?)
loki.query(logql, limit?, since?)
k8s.get(kind, namespace?, name?)
k8s.events(namespace?, since?)
inventory.snapshot() → JSON
systemd.status(unit) # planned
Resources
res://dashboards/top-pods-by-cpu.promql # implemented
res://dashboards/pod-restarts.promql # implemented
res://dashboards/warn-events.logql # implemented
res://snapshots/cluster-inventory.json # planned (dynamic)
Prompts
Triage-Now # stub — summarize alerts, top offenders, recent warnings
Capacity-Check # planned
CrashLoop-Playbook # planned
Security model
- Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to
get/list/watch - NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
- External exposure should use mTLS or OIDC — the bridge is not authenticated in v1
Related docs
- INTENT.md — goals, scope, success criteria, known gaps
- README.md — quick start and smoke tests
- TeleMcpProject.md — project overview and audience