# TeleMcp Blueprint *Building a Kubernetes telemetry MCP bridge* > **Source:** [Original design conversation](https://chatgpt.com/share/68bdf06d-8c2c-8009-90c5-466f9f531d9a) > **Authority:** Scope and priorities are governed by [INTENT.md](../INTENT.md). This document explains *why* each component exists and *how* the bridge is shaped. ## Overview Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box. MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the [Model Context Protocol spec](https://modelcontextprotocol.io). --- ## What we capture ### Minimum viable (current target) **Kubernetes (control + workloads)** - Cluster and node status, taints, conditions, kubelet health - Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age) - Services, Events (warning/error) - Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics **Logs and alerts** - Pod and node logs via Loki/Promtail - Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures **Bridge surface** - Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot` - Resources: saved PromQL/LogQL queries, cluster inventory snapshots - Prompts: triage and operational playbooks (`Triage-Now` implemented; others planned) ### Stretch (deferred) **Host (Linux / node)** - CPU, memory, disk, inode, filesystem, network, NIC errors *(partially covered by node-exporter)* - Distro/kernel/version, packages/updates - Systemd unit status for key services (container runtime, kubelet, nginx, etc.) - Certificates (expiry), time sync status (chrony/ntp) - Firewall/ports (nftables/ufw summary) **Additional Kubernetes signals** - Ingress, Jobs/CronJobs, HPA/VPA - Throttling and OOM kill detail beyond default metrics **Additional bridge capabilities** - `systemd.status`, `tail.pod_logs` tools - Alertmanager API for active-alert summaries - Full MCP transport (stdio/SSE) vs. current HTTP schema approximation --- ## Reference architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ LLM Agent (MCP client) │ └──────────────────────────┬──────────────────────────────────┘ │ MCP (resources / tools / prompts) ┌──────────────────────────▼──────────────────────────────────┐ │ mcp-telemetry-bridge (FastAPI, namespace: mcp) │ │ Read-only proxy to Prometheus, Loki, Kubernetes API │ └──────┬─────────────────┬────────────────────┬───────────────┘ │ │ │ ┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐ │ Prometheus │ │ Loki │ │ Kubernetes API │ │ Alertmanager│ │ Promtail │ │ (in-cluster SA) │ │ Grafana │ │ │ │ │ │ KSM │ │ │ │ │ │ node-export │ │ │ │ │ └─────────────┘ └───────────────┘ └─────────────────┘ ``` ### On the cluster | Component | Status | Role | |-----------|--------|------| | [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) | **Deployed** | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules | | [Loki](https://grafana.com/docs/loki/latest/) + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) | **Deployed** | Log aggregation and shipping | | [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) | **Deployed (optional)** | OTLP in → Prometheus remote-write / Loki out | | [metrics-server](https://github.com/kubernetes-sigs/metrics-server) | Planned | Live resource metrics (`kubectl top` semantics) | | Host DaemonSet sidecar | Planned | systemd, cert, and OS-level facts | We use standard CNCF pieces so agents reason in **PromQL** and **LogQL** and call a single MCP server for answers. --- ## Why these charts? | Chart | Rationale | |-------|-----------| | **kube-prometheus-stack** | One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules | | **Loki + Promtail** | Cheap, scalable log storage without bolting logs into Prometheus | | **OTel Collector** | Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting | Ansible copies opinionated values from `helm/values/` and runs `helm upgrade --install` for each chart. See `ansible/roles/telemetry_stack/tasks/main.yml`. --- ## MCP Telemetry Bridge The bridge (`mcp-telemetry-bridge/`) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts). ### Implementation status | Capability | Status | |------------|--------| | FastAPI service with health check | Done | | `/mcp/schema` discovery endpoint | Done | | `promql.query` | Done | | `loki.query` | Done | | `k8s.get` | Done | | `k8s.events` | Done | | `inventory.snapshot` | Done | | Saved PromQL/LogQL resources | Done (3 queries) | | `Triage-Now` prompt | Stub | | `Capacity-Check`, `CrashLoop-Playbook` prompts | Planned | | `systemd.status` | Planned (requires DaemonSet sidecar) | | `tail.pod_logs` | Planned | | Alertmanager API | Planned | | Full MCP protocol transport | Planned | ### Read-only backends The bridge talks read-only to: - **Prometheus HTTP API** — instant and range queries - **Loki HTTP API** — LogQL queries - **Kubernetes API** — ServiceAccount with RBAC `get`/`list`/`watch` - **Alertmanager API** — planned for active-alert summaries - **Node sidecar HTTP** — planned for host-level facts ### Tools (target API) ``` promql.query(expr, range?) loki.query(logql, limit?, since?) k8s.get(kind, namespace?, name?) k8s.events(namespace?, since?) inventory.snapshot() → JSON systemd.status(unit) # planned ``` ### Resources ``` res://dashboards/top-pods-by-cpu.promql # implemented res://dashboards/pod-restarts.promql # implemented res://dashboards/warn-events.logql # implemented res://snapshots/cluster-inventory.json # planned (dynamic) ``` ### Prompts ``` Triage-Now # stub — summarize alerts, top offenders, recent warnings Capacity-Check # planned CrashLoop-Playbook # planned ``` --- ## Security model - Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to `get`/`list`/`watch` - NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed - External exposure should use mTLS or OIDC — the bridge is not authenticated in v1 --- ## Related docs - [INTENT.md](../INTENT.md) — goals, scope, success criteria, known gaps - [README.md](../README.md) — quick start and smoke tests - [TeleMcpProject.md](TeleMcpProject.md) — project overview and audience