Files
tele-mcp/wiki/TeleMcpBlueprint.md
2026-06-22 19:09:24 +02:00

183 lines
7.6 KiB
Markdown

# TeleMcp Blueprint
*Building a Kubernetes telemetry MCP bridge*
> **Source:** [Original design conversation](https://chatgpt.com/share/68bdf06d-8c2c-8009-90c5-466f9f531d9a)
> **Authority:** Scope and priorities are governed by [INTENT.md](../INTENT.md). This document explains *why* each component exists and *how* the bridge is shaped.
## Overview
Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.
MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the [Model Context Protocol spec](https://modelcontextprotocol.io).
---
## What we capture
### Minimum viable (current target)
**Kubernetes (control + workloads)**
- Cluster and node status, taints, conditions, kubelet health
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
- Services, Events (warning/error)
- Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics
**Logs and alerts**
- Pod and node logs via Loki/Promtail
- Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures
**Bridge surface**
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
- Prompts: triage and operational playbooks (`Triage-Now` implemented; others planned)
### Stretch (deferred)
**Host (Linux / node)**
- CPU, memory, disk, inode, filesystem, network, NIC errors *(partially covered by node-exporter)*
- Distro/kernel/version, packages/updates
- Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
- Certificates (expiry), time sync status (chrony/ntp)
- Firewall/ports (nftables/ufw summary)
**Additional Kubernetes signals**
- Ingress, Jobs/CronJobs, HPA/VPA
- Throttling and OOM kill detail beyond default metrics
**Additional bridge capabilities**
- `systemd.status`, `tail.pod_logs` tools
- Alertmanager API for active-alert summaries
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
---
## Reference architecture
```
┌─────────────────────────────────────────────────────────────┐
│ LLM Agent (MCP client) │
└──────────────────────────┬──────────────────────────────────┘
│ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
└──────┬─────────────────┬────────────────────┬───────────────┘
│ │ │
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
│ Prometheus │ │ Loki │ │ Kubernetes API │
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
│ Grafana │ │ │ │ │
│ KSM │ │ │ │ │
│ node-export │ │ │ │ │
└─────────────┘ └───────────────┘ └─────────────────┘
```
### On the cluster
| Component | Status | Role |
|-----------|--------|------|
| [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) | **Deployed** | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules |
| [Loki](https://grafana.com/docs/loki/latest/) + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) | **Deployed** | Log aggregation and shipping |
| [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) | **Deployed (optional)** | OTLP in → Prometheus remote-write / Loki out |
| [metrics-server](https://github.com/kubernetes-sigs/metrics-server) | Planned | Live resource metrics (`kubectl top` semantics) |
| Host DaemonSet sidecar | Planned | systemd, cert, and OS-level facts |
We use standard CNCF pieces so agents reason in **PromQL** and **LogQL** and call a single MCP server for answers.
---
## Why these charts?
| Chart | Rationale |
|-------|-----------|
| **kube-prometheus-stack** | One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules |
| **Loki + Promtail** | Cheap, scalable log storage without bolting logs into Prometheus |
| **OTel Collector** | Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting |
Ansible copies opinionated values from `helm/values/` and runs `helm upgrade --install` for each chart. See `ansible/roles/telemetry_stack/tasks/main.yml`.
---
## MCP Telemetry Bridge
The bridge (`mcp-telemetry-bridge/`) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).
### Implementation status
| Capability | Status |
|------------|--------|
| FastAPI service with health check | Done |
| `/mcp/schema` discovery endpoint | Done |
| `promql.query` | Done |
| `loki.query` | Done |
| `k8s.get` | Done |
| `k8s.events` | Done |
| `inventory.snapshot` | Done |
| Saved PromQL/LogQL resources | Done (3 queries) |
| `Triage-Now` prompt | Stub |
| `Capacity-Check`, `CrashLoop-Playbook` prompts | Planned |
| `systemd.status` | Planned (requires DaemonSet sidecar) |
| `tail.pod_logs` | Planned |
| Alertmanager API | Planned |
| Full MCP protocol transport | Planned |
### Read-only backends
The bridge talks read-only to:
- **Prometheus HTTP API** — instant and range queries
- **Loki HTTP API** — LogQL queries
- **Kubernetes API** — ServiceAccount with RBAC `get`/`list`/`watch`
- **Alertmanager API** — planned for active-alert summaries
- **Node sidecar HTTP** — planned for host-level facts
### Tools (target API)
```
promql.query(expr, range?)
loki.query(logql, limit?, since?)
k8s.get(kind, namespace?, name?)
k8s.events(namespace?, since?)
inventory.snapshot() → JSON
systemd.status(unit) # planned
```
### Resources
```
res://dashboards/top-pods-by-cpu.promql # implemented
res://dashboards/pod-restarts.promql # implemented
res://dashboards/warn-events.logql # implemented
res://snapshots/cluster-inventory.json # planned (dynamic)
```
### Prompts
```
Triage-Now # stub — summarize alerts, top offenders, recent warnings
Capacity-Check # planned
CrashLoop-Playbook # planned
```
---
## Security model
- Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to `get`/`list`/`watch`
- NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
- External exposure should use mTLS or OIDC — the bridge is not authenticated in v1
---
## Related docs
- [INTENT.md](../INTENT.md) — goals, scope, success criteria, known gaps
- [README.md](../README.md) — quick start and smoke tests
- [TeleMcpProject.md](TeleMcpProject.md) — project overview and audience