tele-mcp/wiki/TeleMcpBlueprint.md

# TeleMcp Blueprint

*Building a Kubernetes telemetry MCP bridge*

> **Source:** [Original design conversation](https://chatgpt.com/share/68bdf06d-8c2c-8009-90c5-466f9f531d9a)
> **Authority:** Scope and priorities are governed by [INTENT.md](../INTENT.md). This document explains *why* each component exists and *how* the bridge is shaped.

## Overview

Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.

MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the [Model Context Protocol spec](https://modelcontextprotocol.io).

---

## What we capture

### Minimum viable (current target)

**Kubernetes (control + workloads)**

- Cluster and node status, taints, conditions, kubelet health
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
- Services, Events (warning/error)
- Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics

**Logs and alerts**

- Pod and node logs via Loki/Promtail
- Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures

**Bridge surface**

- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
- Prompts: triage and operational playbooks (`Triage-Now` implemented; others planned)

### Stretch (deferred)

**Host (Linux / node)**

- CPU, memory, disk, inode, filesystem, network, NIC errors *(partially covered by node-exporter)*
- Distro/kernel/version, packages/updates
- Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
- Certificates (expiry), time sync status (chrony/ntp)
- Firewall/ports (nftables/ufw summary)

**Additional Kubernetes signals**

- Ingress, Jobs/CronJobs, HPA/VPA
- Throttling and OOM kill detail beyond default metrics

**Additional bridge capabilities**

- `systemd.status`, `tail.pod_logs` tools
- Alertmanager API for active-alert summaries
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation

---

## Reference architecture

```
┌─────────────────────────────────────────────────────────────┐
│  LLM Agent (MCP client)                                     │
└──────────────────────────┬──────────────────────────────────┘
                           │ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│  mcp-telemetry-bridge  (FastAPI, namespace: mcp)              │
│  Read-only proxy to Prometheus, Loki, Kubernetes API          │
└──────┬─────────────────┬────────────────────┬───────────────┘
       │                 │                    │
┌──────▼──────┐  ┌───────▼───────┐   ┌────────▼────────┐
│ Prometheus  │  │ Loki          │   │ Kubernetes API  │
│ Alertmanager│  │ Promtail      │   │ (in-cluster SA) │
│ Grafana     │  │               │   │                 │
│ KSM         │  │               │   │                 │
│ node-export │  │               │   │                 │
└─────────────┘  └───────────────┘   └─────────────────┘
```

### On the cluster

| Component | Status | Role |
|-----------|--------|------|
| [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) | **Deployed** | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules |
| [Loki](https://grafana.com/docs/loki/latest/) + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) | **Deployed** | Log aggregation and shipping |
| [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) | **Deployed (optional)** | OTLP in → Prometheus remote-write / Loki out |
| [metrics-server](https://github.com/kubernetes-sigs/metrics-server) | Planned | Live resource metrics (`kubectl top` semantics) |
| Host DaemonSet sidecar | Planned | systemd, cert, and OS-level facts |

We use standard CNCF pieces so agents reason in **PromQL** and **LogQL** and call a single MCP server for answers.

---

## Why these charts?

| Chart | Rationale |
|-------|-----------|
| **kube-prometheus-stack** | One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules |
| **Loki + Promtail** | Cheap, scalable log storage without bolting logs into Prometheus |
| **OTel Collector** | Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting |

Ansible copies opinionated values from `helm/values/` and runs `helm upgrade --install` for each chart. See `ansible/roles/telemetry_stack/tasks/main.yml`.

---

## MCP Telemetry Bridge

The bridge (`mcp-telemetry-bridge/`) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).

### Implementation status

| Capability | Status |
|------------|--------|
| FastAPI service with health check | Done |
| `/mcp/schema` discovery endpoint | Done |
| `promql.query` | Done |
| `loki.query` | Done |
| `k8s.get` | Done |
| `k8s.events` | Done |
| `inventory.snapshot` | Done |
| Saved PromQL/LogQL resources | Done (3 queries) |
| `Triage-Now` prompt | Stub |
| `Capacity-Check`, `CrashLoop-Playbook` prompts | Planned |
| `systemd.status` | Planned (requires DaemonSet sidecar) |
| `tail.pod_logs` | Planned |
| Alertmanager API | Planned |
| Full MCP protocol transport | Planned |

### Read-only backends

The bridge talks read-only to:

- **Prometheus HTTP API** — instant and range queries
- **Loki HTTP API** — LogQL queries
- **Kubernetes API** — ServiceAccount with RBAC `get`/`list`/`watch`
- **Alertmanager API** — planned for active-alert summaries
- **Node sidecar HTTP** — planned for host-level facts

### Tools (target API)

```
promql.query(expr, range?)
loki.query(logql, limit?, since?)
k8s.get(kind, namespace?, name?)
k8s.events(namespace?, since?)
inventory.snapshot() → JSON
systemd.status(unit)          # planned
```

### Resources

```
res://dashboards/top-pods-by-cpu.promql    # implemented
res://dashboards/pod-restarts.promql       # implemented
res://dashboards/warn-events.logql         # implemented
res://snapshots/cluster-inventory.json     # planned (dynamic)
```

### Prompts

```
Triage-Now           # stub — summarize alerts, top offenders, recent warnings
Capacity-Check       # planned
CrashLoop-Playbook   # planned
```

---

## Security model

- Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to `get`/`list`/`watch`
- NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
- External exposure should use mTLS or OIDC — the bridge is not authenticated in v1

---

## Related docs

- [INTENT.md](../INTENT.md) — goals, scope, success criteria, known gaps
- [README.md](../README.md) — quick start and smoke tests
- [TeleMcpProject.md](TeleMcpProject.md) — project overview and audience