183 lines
7.6 KiB
Markdown
183 lines
7.6 KiB
Markdown
# TeleMcp Blueprint
|
|
|
|
*Building a Kubernetes telemetry MCP bridge*
|
|
|
|
> **Source:** [Original design conversation](https://chatgpt.com/share/68bdf06d-8c2c-8009-90c5-466f9f531d9a)
|
|
> **Authority:** Scope and priorities are governed by [INTENT.md](../INTENT.md). This document explains *why* each component exists and *how* the bridge is shaped.
|
|
|
|
## Overview
|
|
|
|
Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.
|
|
|
|
MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the [Model Context Protocol spec](https://modelcontextprotocol.io).
|
|
|
|
---
|
|
|
|
## What we capture
|
|
|
|
### Minimum viable (current target)
|
|
|
|
**Kubernetes (control + workloads)**
|
|
|
|
- Cluster and node status, taints, conditions, kubelet health
|
|
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
|
|
- Services, Events (warning/error)
|
|
- Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics
|
|
|
|
**Logs and alerts**
|
|
|
|
- Pod and node logs via Loki/Promtail
|
|
- Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures
|
|
|
|
**Bridge surface**
|
|
|
|
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
|
|
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
|
|
- Prompts: triage and operational playbooks (`Triage-Now` implemented; others planned)
|
|
|
|
### Stretch (deferred)
|
|
|
|
**Host (Linux / node)**
|
|
|
|
- CPU, memory, disk, inode, filesystem, network, NIC errors *(partially covered by node-exporter)*
|
|
- Distro/kernel/version, packages/updates
|
|
- Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
|
|
- Certificates (expiry), time sync status (chrony/ntp)
|
|
- Firewall/ports (nftables/ufw summary)
|
|
|
|
**Additional Kubernetes signals**
|
|
|
|
- Ingress, Jobs/CronJobs, HPA/VPA
|
|
- Throttling and OOM kill detail beyond default metrics
|
|
|
|
**Additional bridge capabilities**
|
|
|
|
- `systemd.status`, `tail.pod_logs` tools
|
|
- Alertmanager API for active-alert summaries
|
|
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
|
|
|
|
---
|
|
|
|
## Reference architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ LLM Agent (MCP client) │
|
|
└──────────────────────────┬──────────────────────────────────┘
|
|
│ MCP (resources / tools / prompts)
|
|
┌──────────────────────────▼──────────────────────────────────┐
|
|
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
|
|
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
|
|
└──────┬─────────────────┬────────────────────┬───────────────┘
|
|
│ │ │
|
|
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
|
|
│ Prometheus │ │ Loki │ │ Kubernetes API │
|
|
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
|
|
│ Grafana │ │ │ │ │
|
|
│ KSM │ │ │ │ │
|
|
│ node-export │ │ │ │ │
|
|
└─────────────┘ └───────────────┘ └─────────────────┘
|
|
```
|
|
|
|
### On the cluster
|
|
|
|
| Component | Status | Role |
|
|
|-----------|--------|------|
|
|
| [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) | **Deployed** | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules |
|
|
| [Loki](https://grafana.com/docs/loki/latest/) + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) | **Deployed** | Log aggregation and shipping |
|
|
| [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) | **Deployed (optional)** | OTLP in → Prometheus remote-write / Loki out |
|
|
| [metrics-server](https://github.com/kubernetes-sigs/metrics-server) | Planned | Live resource metrics (`kubectl top` semantics) |
|
|
| Host DaemonSet sidecar | Planned | systemd, cert, and OS-level facts |
|
|
|
|
We use standard CNCF pieces so agents reason in **PromQL** and **LogQL** and call a single MCP server for answers.
|
|
|
|
---
|
|
|
|
## Why these charts?
|
|
|
|
| Chart | Rationale |
|
|
|-------|-----------|
|
|
| **kube-prometheus-stack** | One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules |
|
|
| **Loki + Promtail** | Cheap, scalable log storage without bolting logs into Prometheus |
|
|
| **OTel Collector** | Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting |
|
|
|
|
Ansible copies opinionated values from `helm/values/` and runs `helm upgrade --install` for each chart. See `ansible/roles/telemetry_stack/tasks/main.yml`.
|
|
|
|
---
|
|
|
|
## MCP Telemetry Bridge
|
|
|
|
The bridge (`mcp-telemetry-bridge/`) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).
|
|
|
|
### Implementation status
|
|
|
|
| Capability | Status |
|
|
|------------|--------|
|
|
| FastAPI service with health check | Done |
|
|
| `/mcp/schema` discovery endpoint | Done |
|
|
| `promql.query` | Done |
|
|
| `loki.query` | Done |
|
|
| `k8s.get` | Done |
|
|
| `k8s.events` | Done |
|
|
| `inventory.snapshot` | Done |
|
|
| Saved PromQL/LogQL resources | Done (3 queries) |
|
|
| `Triage-Now` prompt | Stub |
|
|
| `Capacity-Check`, `CrashLoop-Playbook` prompts | Planned |
|
|
| `systemd.status` | Planned (requires DaemonSet sidecar) |
|
|
| `tail.pod_logs` | Planned |
|
|
| Alertmanager API | Planned |
|
|
| Full MCP protocol transport | Planned |
|
|
|
|
### Read-only backends
|
|
|
|
The bridge talks read-only to:
|
|
|
|
- **Prometheus HTTP API** — instant and range queries
|
|
- **Loki HTTP API** — LogQL queries
|
|
- **Kubernetes API** — ServiceAccount with RBAC `get`/`list`/`watch`
|
|
- **Alertmanager API** — planned for active-alert summaries
|
|
- **Node sidecar HTTP** — planned for host-level facts
|
|
|
|
### Tools (target API)
|
|
|
|
```
|
|
promql.query(expr, range?)
|
|
loki.query(logql, limit?, since?)
|
|
k8s.get(kind, namespace?, name?)
|
|
k8s.events(namespace?, since?)
|
|
inventory.snapshot() → JSON
|
|
systemd.status(unit) # planned
|
|
```
|
|
|
|
### Resources
|
|
|
|
```
|
|
res://dashboards/top-pods-by-cpu.promql # implemented
|
|
res://dashboards/pod-restarts.promql # implemented
|
|
res://dashboards/warn-events.logql # implemented
|
|
res://snapshots/cluster-inventory.json # planned (dynamic)
|
|
```
|
|
|
|
### Prompts
|
|
|
|
```
|
|
Triage-Now # stub — summarize alerts, top offenders, recent warnings
|
|
Capacity-Check # planned
|
|
CrashLoop-Playbook # planned
|
|
```
|
|
|
|
---
|
|
|
|
## Security model
|
|
|
|
- Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to `get`/`list`/`watch`
|
|
- NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
|
|
- External exposure should use mTLS or OIDC — the bridge is not authenticated in v1
|
|
|
|
---
|
|
|
|
## Related docs
|
|
|
|
- [INTENT.md](../INTENT.md) — goals, scope, success criteria, known gaps
|
|
- [README.md](../README.md) — quick start and smoke tests
|
|
- [TeleMcpProject.md](TeleMcpProject.md) — project overview and audience |