Files

tegwick 129a229e38 Seeded intent and wiki pages

2026-06-22 19:09:24 +02:00

7.6 KiB

Raw Blame History

TeleMcp Blueprint

Building a Kubernetes telemetry MCP bridge

Source: Original design conversation
Authority: Scope and priorities are governed by INTENT.md. This document explains why each component exists and how the bridge is shaped.

Overview

Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.

MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the Model Context Protocol spec.

What we capture

Minimum viable (current target)

Kubernetes (control + workloads)

Cluster and node status, taints, conditions, kubelet health
Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
Services, Events (warning/error)
Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics

Logs and alerts

Pod and node logs via Loki/Promtail
Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures

Bridge surface

Tools: promql.query, loki.query, k8s.get, k8s.events, inventory.snapshot
Resources: saved PromQL/LogQL queries, cluster inventory snapshots
Prompts: triage and operational playbooks (Triage-Now implemented; others planned)

Stretch (deferred)

Host (Linux / node)

CPU, memory, disk, inode, filesystem, network, NIC errors (partially covered by node-exporter)
Distro/kernel/version, packages/updates
Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
Certificates (expiry), time sync status (chrony/ntp)
Firewall/ports (nftables/ufw summary)

Additional Kubernetes signals

Ingress, Jobs/CronJobs, HPA/VPA
Throttling and OOM kill detail beyond default metrics

Additional bridge capabilities

systemd.status, tail.pod_logs tools
Alertmanager API for active-alert summaries
Full MCP transport (stdio/SSE) vs. current HTTP schema approximation

Reference architecture

┌─────────────────────────────────────────────────────────────┐
│  LLM Agent (MCP client)                                     │
└──────────────────────────┬──────────────────────────────────┘
                           │ MCP (resources / tools / prompts)
┌──────────────────────────▼──────────────────────────────────┐
│  mcp-telemetry-bridge  (FastAPI, namespace: mcp)              │
│  Read-only proxy to Prometheus, Loki, Kubernetes API          │
└──────┬─────────────────┬────────────────────┬───────────────┘
       │                 │                    │
┌──────▼──────┐  ┌───────▼───────┐   ┌────────▼────────┐
│ Prometheus  │  │ Loki          │   │ Kubernetes API  │
│ Alertmanager│  │ Promtail      │   │ (in-cluster SA) │
│ Grafana     │  │               │   │                 │
│ KSM         │  │               │   │                 │
│ node-export │  │               │   │                 │
└─────────────┘  └───────────────┘   └─────────────────┘

On the cluster

Component	Status	Role
kube-prometheus-stack	Deployed	Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules
Loki + Promtail	Deployed	Log aggregation and shipping
OpenTelemetry Collector	Deployed (optional)	OTLP in → Prometheus remote-write / Loki out
metrics-server	Planned	Live resource metrics (`kubectl top` semantics)
Host DaemonSet sidecar	Planned	systemd, cert, and OS-level facts

We use standard CNCF pieces so agents reason in PromQL and LogQL and call a single MCP server for answers.

Why these charts?

Chart	Rationale
kube-prometheus-stack	One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules
Loki + Promtail	Cheap, scalable log storage without bolting logs into Prometheus
OTel Collector	Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting

Ansible copies opinionated values from helm/values/ and runs helm upgrade --install for each chart. See ansible/roles/telemetry_stack/tasks/main.yml.

MCP Telemetry Bridge

The bridge (mcp-telemetry-bridge/) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).

Implementation status

Capability	Status
FastAPI service with health check	Done
`/mcp/schema` discovery endpoint	Done
`promql.query`	Done
`loki.query`	Done
`k8s.get`	Done
`k8s.events`	Done
`inventory.snapshot`	Done
Saved PromQL/LogQL resources	Done (3 queries)
`Triage-Now` prompt	Stub
`Capacity-Check`, `CrashLoop-Playbook` prompts	Planned
`systemd.status`	Planned (requires DaemonSet sidecar)
`tail.pod_logs`	Planned
Alertmanager API	Planned
Full MCP protocol transport	Planned

Read-only backends

The bridge talks read-only to:

Prometheus HTTP API — instant and range queries
Loki HTTP API — LogQL queries
Kubernetes API — ServiceAccount with RBAC get/list/watch
Alertmanager API — planned for active-alert summaries
Node sidecar HTTP — planned for host-level facts

Tools (target API)

promql.query(expr, range?)
loki.query(logql, limit?, since?)
k8s.get(kind, namespace?, name?)
k8s.events(namespace?, since?)
inventory.snapshot() → JSON
systemd.status(unit)          # planned

Resources

res://dashboards/top-pods-by-cpu.promql    # implemented
res://dashboards/pod-restarts.promql       # implemented
res://dashboards/warn-events.logql         # implemented
res://snapshots/cluster-inventory.json     # planned (dynamic)

Prompts

Triage-Now           # stub — summarize alerts, top offenders, recent warnings
Capacity-Check       # planned
CrashLoop-Playbook   # planned

Security model

Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to get/list/watch
NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
External exposure should use mTLS or OIDC — the bridge is not authenticated in v1

INTENT.md — goals, scope, success criteria, known gaps
README.md — quick start and smoke tests
TeleMcpProject.md — project overview and audience

7.6 KiB Raw Blame History