Compare commits
2 Commits
b181465564
...
f3ca5b9c3a
| Author | SHA1 | Date | |
|---|---|---|---|
| f3ca5b9c3a | |||
| 129a229e38 |
20
.claude/rules/agents.md
Normal file
20
.claude/rules/agents.md
Normal file
@@ -0,0 +1,20 @@
|
||||
## Kaizen Agents
|
||||
|
||||
Specialized agent personas available on demand via the state-hub MCP.
|
||||
|
||||
**Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
|
||||
**Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
|
||||
|
||||
Common agents:
|
||||
|
||||
| Agent | Category | When to use |
|
||||
|-------|----------|-------------|
|
||||
| `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
|
||||
| `code-refactoring` | quality | Code quality analysis and safe refactoring |
|
||||
| `test-maintenance` | testing | Diagnose and fix failing tests |
|
||||
| `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
|
||||
| `keepaTodofile` | process | Maintain TODO.md during work |
|
||||
| `project-management` | process | Track status, determine next steps |
|
||||
| `datamodel-optimization` | quality | Optimize dataclasses and data structures |
|
||||
|
||||
All 17 agents: call `list_kaizen_agents()` for the full list.
|
||||
8
.claude/rules/architecture.md
Normal file
8
.claude/rules/architecture.md
Normal file
@@ -0,0 +1,8 @@
|
||||
## Architecture
|
||||
|
||||
<!-- TODO: Describe the key design decisions and component structure.
|
||||
Key modules, data flows, external integrations, state machines, etc. -->
|
||||
|
||||
## Quick Reference
|
||||
|
||||
`~/state-hub/mcp_server/TOOLS.md` — MCP tool reference
|
||||
50
.claude/rules/credential-routing.md
Normal file
50
.claude/rules/credential-routing.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# Credential and access routing
|
||||
|
||||
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
|
||||
for inference. Run this check **before** requesting secrets, API keys, SSH access,
|
||||
login tokens, or database passwords — in any repo, not only `ops-warden`.
|
||||
|
||||
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
|
||||
other credential need belongs to another subsystem. **Do not** message
|
||||
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
|
||||
|
||||
### Lookup (do this first)
|
||||
|
||||
```bash
|
||||
warden route find "<describe your need>" --json
|
||||
warden route show <catalog-id> --json
|
||||
```
|
||||
|
||||
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
|
||||
|
||||
| Agent runtime | How to orient |
|
||||
| --- | --- |
|
||||
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=tele-mcp` is for coordination, not secret vending |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||
|
||||
### Quick routing table
|
||||
|
||||
| I need… | Owner | ops-warden executes? |
|
||||
| --- | --- | --- |
|
||||
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
|
||||
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
|
||||
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
|
||||
| Authorization decision | flex-auth | No — route only |
|
||||
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
|
||||
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
|
||||
|
||||
### Anti-patterns (do not do these)
|
||||
|
||||
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
|
||||
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
|
||||
- Pasting secrets into Git, State Hub, workplans, logs, or chat
|
||||
|
||||
### Other capabilities (reuse-surface)
|
||||
|
||||
Non-credential capabilities are usually discovered through **reuse-surface** federation
|
||||
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
|
||||
every repo's agent instructions because it is high-frequency, high-risk, and easy to
|
||||
get wrong.
|
||||
|
||||
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
|
||||
38
.claude/rules/first-session.md
Normal file
38
.claude/rules/first-session.md
Normal file
@@ -0,0 +1,38 @@
|
||||
## First Session Protocol
|
||||
|
||||
Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
|
||||
The project is registered but work has not yet been structured.
|
||||
|
||||
**Step 1 — Read, don't write**
|
||||
- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
|
||||
- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
|
||||
- Scan repo root: README, directory structure, existing code or docs
|
||||
|
||||
**Step 2 — Survey in-progress work**
|
||||
Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
|
||||
|
||||
**Step 3 — Propose workstreams to Bernd**
|
||||
Propose 1–3 workstreams — each a coherent strand, weeks to months, anchored to a
|
||||
roadmap phase. **Wait for approval before creating.**
|
||||
|
||||
**Step 4 — Create workplan file first, then DB record (ADR-001)**
|
||||
```
|
||||
workplans/TELE-WP-NNNN-<slug>.md ← write this first
|
||||
```
|
||||
Then register in the hub:
|
||||
```
|
||||
create_workstream(topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", title="...", owner="...", description="...")
|
||||
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
|
||||
```
|
||||
|
||||
**Step 5 — Record the setup**
|
||||
```
|
||||
add_progress_event(
|
||||
summary="First session: structured infotech into N workstreams, M tasks",
|
||||
event_type="milestone",
|
||||
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
|
||||
detail={"workstreams": [...], "tasks_created": M}
|
||||
)
|
||||
```
|
||||
|
||||
<!-- Delete or archive this file once past first session -->
|
||||
8
.claude/rules/repo-boundary.md
Normal file
8
.claude/rules/repo-boundary.md
Normal file
@@ -0,0 +1,8 @@
|
||||
## Repo boundary
|
||||
|
||||
This repo owns **tele-mcp** only. It does not own:
|
||||
|
||||
<!-- TODO: List what belongs in adjacent repos, e.g.:
|
||||
- SSH key management → railiance-infra/
|
||||
- State hub code → state-hub/
|
||||
-->
|
||||
5
.claude/rules/repo-identity.md
Normal file
5
.claude/rules/repo-identity.md
Normal file
@@ -0,0 +1,5 @@
|
||||
**Purpose:** **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**.
|
||||
|
||||
**Domain:** infotech
|
||||
**Repo slug:** tele-mcp
|
||||
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a
|
||||
85
.claude/rules/session-protocol.md
Normal file
85
.claude/rules/session-protocol.md
Normal file
@@ -0,0 +1,85 @@
|
||||
## Session Protocol
|
||||
|
||||
Dev Hub (State Hub API): http://127.0.0.1:8000
|
||||
MCP server name in `~/.claude.json`: `dev-hub`
|
||||
|
||||
**Step 1 — Orient**
|
||||
|
||||
Read the offline-safe brief first — it works without a live hub connection:
|
||||
```bash
|
||||
cat .custodian-brief.md
|
||||
```
|
||||
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
|
||||
```
|
||||
get_domain_summary("infotech")
|
||||
```
|
||||
If MCP tools are unavailable in the current agent session, use the REST API:
|
||||
```bash
|
||||
curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
|
||||
```
|
||||
If the hub is offline: `cd ~/state-hub && make api`
|
||||
|
||||
**Step 2 — Check inbox**
|
||||
With MCP tools:
|
||||
```
|
||||
get_messages(to_agent="tele-mcp", unread_only=True)
|
||||
```
|
||||
Mark read with `mark_message_read(message_id)`. Reply or act on coordination
|
||||
requests before proceeding.
|
||||
|
||||
Without MCP tools:
|
||||
```bash
|
||||
curl -s "http://127.0.0.1:8000/messages/?to_agent=tele-mcp&unread_only=true" \
|
||||
| python3 -m json.tool
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
|
||||
-H "Content-Type: application/json" -d '{}'
|
||||
```
|
||||
|
||||
**Step 3 — Scan workplans**
|
||||
```bash
|
||||
ls workplans/
|
||||
```
|
||||
For each file with `status: ready`, `active`, or `blocked`, note pending
|
||||
`wait`/`todo`/`progress` tasks.
|
||||
|
||||
**Step 4 — Present brief**
|
||||
|
||||
1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
|
||||
2. **Pending tasks** from `workplans/` + any `[repo:tele-mcp]` hub tasks
|
||||
3. **Goal guidance** — if `goal_guidance` in summary:
|
||||
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
|
||||
- `alignment_warnings`: flag if active work is not aligned with current goal
|
||||
4. **Suggested next action** — highest-priority open item
|
||||
5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
|
||||
|
||||
If no workstreams: follow First Session Protocol (`first-session.md`).
|
||||
|
||||
**During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
|
||||
|
||||
> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
|
||||
> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
|
||||
|
||||
**Session close:**
|
||||
With MCP tools:
|
||||
```
|
||||
add_progress_event(summary="...", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", workstream_id="<uuid>")
|
||||
```
|
||||
Without MCP tools:
|
||||
```bash
|
||||
curl -s -X POST http://127.0.0.1:8000/progress/ \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"topic_id":"cee7bedf-2b48-46ef-8601-006474f2ad7a","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
|
||||
```
|
||||
If workplan files were modified, ensure the local copy is up to date first:
|
||||
```bash
|
||||
git -C <repo_path> pull --ff-only
|
||||
cd ~/state-hub && make fix-consistency REPO=tele-mcp
|
||||
```
|
||||
For repos where implementation runs on a remote machine (e.g. CoulombCore),
|
||||
use the combined target which pulls before fixing:
|
||||
```bash
|
||||
cd ~/state-hub && make fix-consistency-remote REPO=tele-mcp
|
||||
```
|
||||
**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
|
||||
will sync the file to match DB. **C-16** (repo behind remote) blocks all writes
|
||||
until you pull — intentional to prevent clobbering remote progress.
|
||||
19
.claude/rules/stack-and-commands.md
Normal file
19
.claude/rules/stack-and-commands.md
Normal file
@@ -0,0 +1,19 @@
|
||||
## Stack
|
||||
|
||||
<!-- TODO: Fill in language, frameworks, and key dependencies -->
|
||||
- **Language:**
|
||||
- **Key deps:**
|
||||
|
||||
## Dev Commands
|
||||
|
||||
```bash
|
||||
# TODO: Fill in the standard commands for this repo
|
||||
|
||||
# Install dependencies
|
||||
|
||||
# Run tests
|
||||
|
||||
# Lint / type check
|
||||
|
||||
# Build / package (if applicable)
|
||||
```
|
||||
40
.claude/rules/workplan-convention.md
Normal file
40
.claude/rules/workplan-convention.md
Normal file
@@ -0,0 +1,40 @@
|
||||
## Workplan Convention (ADR-001)
|
||||
|
||||
File location: `workplans/TELE-WP-NNNN-<slug>.md`
|
||||
ID prefix: `TELE-WP-`
|
||||
|
||||
Work items originate as files in this repo **before** being registered in the hub.
|
||||
|
||||
Canonical workplan/workstream frontmatter statuses are:
|
||||
`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
|
||||
Use `proposed` for a newly drafted plan, `ready` after review against current
|
||||
repo state, and `finished` when implementation is complete. `stalled` and
|
||||
`needs_review` are derived health labels, not stored statuses.
|
||||
|
||||
Closed workplans may be moved to `workplans/archived/` with a completion-date
|
||||
prefix: `YYMMDD-TELE-WP-NNNN-<slug>.md`. The frontmatter id remains
|
||||
unchanged; the prefix is only for quick visual reference.
|
||||
|
||||
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
|
||||
`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
|
||||
`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
|
||||
directly. Promote anything requiring analysis, design, approval, dependencies, or
|
||||
multiple planned phases into a normal workplan.
|
||||
|
||||
Ecosystem todos from other agents arrive as `[repo:tele-mcp]` hub tasks —
|
||||
visible at session start. Pick one up by creating the workplan file, then registering
|
||||
the workstream.
|
||||
|
||||
Task blocks use this shape:
|
||||
|
||||
```task
|
||||
id: TELE-WP-NNNN-T01
|
||||
status: wait | todo | progress | done | cancel
|
||||
priority: high | medium | low
|
||||
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
```
|
||||
|
||||
Status progression is `todo` → `progress` → `done`; use `wait` for waiting or
|
||||
blocked work and `cancel` for stopped work.
|
||||
|
||||
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
|
||||
27
.custodian-brief.md
Normal file
27
.custodian-brief.md
Normal file
@@ -0,0 +1,27 @@
|
||||
<!-- custodian-brief: generated by statehub register; fix-consistency may replace this file -->
|
||||
# Custodian Brief - tele-mcp
|
||||
|
||||
**Project:** tele-mcp
|
||||
**Domain:** infotech
|
||||
**State Hub:** http://127.0.0.1:8000
|
||||
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
|
||||
|
||||
## Open Workplans
|
||||
|
||||
### Bootstrap State Hub integration
|
||||
|
||||
Workplan file: `workplans/TELE-WP-0001-statehub-bootstrap.md`
|
||||
|
||||
Open tasks:
|
||||
- T01 - Review generated integration files
|
||||
- T02 - Verify local developer workflow
|
||||
- T03 - Seed first real workplan
|
||||
|
||||
## Session Start
|
||||
|
||||
1. Read `INTENT.md`, `SCOPE.md`, and `AGENTS.md`.
|
||||
2. Check inbox: `GET /messages/?to_agent=tele-mcp&unread_only=true`.
|
||||
3. Scan `workplans/`.
|
||||
4. Update task statuses in workplan files as work progresses.
|
||||
|
||||
Last generated: 2026-06-22
|
||||
219
AGENTS.md
Normal file
219
AGENTS.md
Normal file
@@ -0,0 +1,219 @@
|
||||
# tele-mcp — Agent Instructions
|
||||
|
||||
## Repo Identity
|
||||
|
||||
**Purpose:** **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**.
|
||||
|
||||
**Domain:** infotech
|
||||
**Repo slug:** tele-mcp
|
||||
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
|
||||
**Workplan prefix:** `TELE-WP-`
|
||||
|
||||
---
|
||||
|
||||
## State Hub Integration
|
||||
|
||||
The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
|
||||
there is no MCP server for Codex agents.
|
||||
|
||||
| Context | URL |
|
||||
|---------|-----|
|
||||
| Local workstation | `http://127.0.0.1:8000` |
|
||||
| Remote via tunnel | `http://127.0.0.1:18000` |
|
||||
|
||||
### Orient at session start
|
||||
|
||||
```bash
|
||||
# Offline brief — works without hub connection
|
||||
cat .custodian-brief.md
|
||||
|
||||
# Active workstreams for this domain
|
||||
curl -s "http://127.0.0.1:8000/workstreams/?topic_id=cee7bedf-2b48-46ef-8601-006474f2ad7a&status=active" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# Check inbox
|
||||
curl -s "http://127.0.0.1:8000/messages/?to_agent=tele-mcp&unread_only=true" \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
Mark a message read:
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
|
||||
-H "Content-Type: application/json" -d '{}'
|
||||
```
|
||||
|
||||
### Log progress (required at session close)
|
||||
|
||||
```bash
|
||||
curl -s -X POST http://127.0.0.1:8000/progress/ \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"summary": "what was done",
|
||||
"event_type": "note",
|
||||
"author": "codex",
|
||||
"workstream_id": "<uuid>",
|
||||
"task_id": "<uuid>"
|
||||
}'
|
||||
```
|
||||
|
||||
Omit `workstream_id` / `task_id` when not applicable.
|
||||
|
||||
### Update task status
|
||||
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"status": "progress"}'
|
||||
# values: wait | todo | progress | done | cancel
|
||||
```
|
||||
|
||||
### Flag a task for human review
|
||||
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"needs_human": true, "intervention_note": "reason"}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Session Protocol
|
||||
|
||||
**Start:**
|
||||
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
|
||||
2. Check inbox: `GET /messages/?to_agent=tele-mcp&unread_only=true`; mark read
|
||||
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
|
||||
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
|
||||
|
||||
**During work:**
|
||||
- Update task statuses in workplan files as tasks progress
|
||||
- Record significant decisions via `POST /decisions/`
|
||||
|
||||
**Close:**
|
||||
1. Update workplan file task statuses to reflect progress
|
||||
2. Log: `POST /progress/` with a summary of what changed
|
||||
3. Note for the custodian operator: after workplan file changes, run from
|
||||
`~/state-hub`:
|
||||
```bash
|
||||
make fix-consistency REPO=tele-mcp
|
||||
```
|
||||
This syncs task status from files into the hub DB.
|
||||
|
||||
---
|
||||
|
||||
## Credential and access routing
|
||||
|
||||
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
|
||||
for inference. Run this check **before** requesting secrets, API keys, SSH access,
|
||||
login tokens, or database passwords — in any repo, not only `ops-warden`.
|
||||
|
||||
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
|
||||
other credential need belongs to another subsystem. **Do not** message
|
||||
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
|
||||
|
||||
### Lookup (do this first)
|
||||
|
||||
```bash
|
||||
warden route find "<describe your need>" --json
|
||||
warden route show <catalog-id> --json
|
||||
```
|
||||
|
||||
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
|
||||
|
||||
| Agent runtime | How to orient |
|
||||
| --- | --- |
|
||||
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=tele-mcp` is for coordination, not secret vending |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||
|
||||
### Quick routing table
|
||||
|
||||
| I need… | Owner | ops-warden executes? |
|
||||
| --- | --- | --- |
|
||||
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
|
||||
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
|
||||
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
|
||||
| Authorization decision | flex-auth | No — route only |
|
||||
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
|
||||
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
|
||||
|
||||
### Anti-patterns (do not do these)
|
||||
|
||||
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
|
||||
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
|
||||
- Pasting secrets into Git, State Hub, workplans, logs, or chat
|
||||
|
||||
### Other capabilities (reuse-surface)
|
||||
|
||||
Non-credential capabilities are usually discovered through **reuse-surface** federation
|
||||
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
|
||||
every repo's agent instructions because it is high-frequency, high-risk, and easy to
|
||||
get wrong.
|
||||
|
||||
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
|
||||
|
||||
<!-- REPO-AGENTS-EXTENSIONS -->
|
||||
<!-- Append repo-specific agent instructions below this marker.
|
||||
The state-hub template sync preserves content after this line. -->
|
||||
|
||||
---
|
||||
|
||||
## Workplan Convention (ADR-001)
|
||||
|
||||
Work items originate as files in this repo — not in the hub. The hub is a
|
||||
read/cache/index layer that rebuilds from files.
|
||||
|
||||
**File location:** `workplans/TELE-WP-NNNN-<slug>.md`
|
||||
|
||||
**Archived location:** finished workplans may move to
|
||||
`workplans/archived/YYMMDD-TELE-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
|
||||
the completion/archive date; the frontmatter `id` does not change.
|
||||
|
||||
**Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
|
||||
`workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
|
||||
this only for low-risk work completed directly; create a normal workplan for
|
||||
anything needing analysis, design, approval, dependencies, or multiple phases.
|
||||
|
||||
**Frontmatter:**
|
||||
|
||||
```yaml
|
||||
---
|
||||
id: TELE-WP-NNNN
|
||||
type: workplan
|
||||
title: "..."
|
||||
domain: infotech
|
||||
repo: tele-mcp
|
||||
status: proposed | ready | active | blocked | backlog | finished | archived
|
||||
owner: codex
|
||||
topic_slug: ...
|
||||
created: "YYYY-MM-DD"
|
||||
updated: "YYYY-MM-DD"
|
||||
state_hub_workstream_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
---
|
||||
```
|
||||
|
||||
Use `proposed` for a new draft, `ready` after review against current repo
|
||||
state, and `finished` after implementation. `stalled` and `needs_review` are
|
||||
derived health labels, not frontmatter statuses.
|
||||
|
||||
**Task block format** (one per `##` section):
|
||||
|
||||
```
|
||||
## Task Title
|
||||
|
||||
` ` `task
|
||||
id: TELE-WP-NNNN-T01
|
||||
status: wait | todo | progress | done | cancel
|
||||
priority: high | medium | low
|
||||
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
` ` `
|
||||
|
||||
Task description text.
|
||||
```
|
||||
|
||||
Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
|
||||
|
||||
To create a new workplan:
|
||||
1. Write the file following the format above
|
||||
2. Notify the custodian operator to run `make fix-consistency REPO=tele-mcp`
|
||||
(or send a message to the hub agent via `POST /messages/`)
|
||||
12
CLAUDE.md
Normal file
12
CLAUDE.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# tele-mcp — Claude Code Instructions
|
||||
|
||||
@SCOPE.md
|
||||
@.claude/rules/repo-identity.md
|
||||
@.claude/rules/session-protocol.md
|
||||
@.claude/rules/first-session.md
|
||||
@.claude/rules/workplan-convention.md
|
||||
@.claude/rules/stack-and-commands.md
|
||||
@.claude/rules/architecture.md
|
||||
@.claude/rules/repo-boundary.md
|
||||
@.claude/rules/credential-routing.md
|
||||
@.claude/rules/agents.md
|
||||
171
INTENT.md
Normal file
171
INTENT.md
Normal file
@@ -0,0 +1,171 @@
|
||||
# TeleMcp — Project Intent
|
||||
|
||||
> **Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
|
||||
|
||||
TeleMcp is a self-contained observability stack that deploys onto a Linux Kubernetes host and surfaces metrics, logs, and cluster state through a single **Model Context Protocol (MCP)** bridge. The goal is to let an autonomous agent — or a human with an agent — **bootstrap, monitor, triage, and operate** a box without bespoke integrations or constant human supervision.
|
||||
|
||||
This document anchors what we are building, why, and what is in scope. When in doubt, prefer the simplest path that gives an agent reliable, read-only situational awareness.
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
Operating a Kubernetes host means juggling many signals across many systems: node health, workload status, logs, alerts, certificates, systemd units, and more. Humans use Grafana dashboards, `kubectl`, and ad-hoc PromQL/LogQL. LLM agents need the same information, but through a **standardized, safe interface** — not raw shell access.
|
||||
|
||||
TeleMcp closes that gap by:
|
||||
|
||||
1. **Collecting** telemetry with proven CNCF/Grafana stack components.
|
||||
2. **Deploying** the stack repeatably via Ansible + Helm.
|
||||
3. **Bridging** everything to agents through one MCP server with resources, tools, and prompts.
|
||||
|
||||
---
|
||||
|
||||
## Vision
|
||||
|
||||
A single `ansible-playbook` (or equivalent) turns a bare k3s/kubeadm host into a monitored, agent-ready environment. An LLM agent connects to the MCP bridge and can answer questions like:
|
||||
|
||||
- *What is unhealthy right now?*
|
||||
- *Which pods are crash-looping and why?*
|
||||
- *Is disk or memory pressure building?*
|
||||
- *What changed in the cluster since yesterday?*
|
||||
|
||||
The agent reasons in **PromQL** and **LogQL** — industry-standard query languages — and calls parameterized tools rather than scraping raw APIs itself.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
| Principle | What it means |
|
||||
|-----------|---------------|
|
||||
| **Read-only by default** | The MCP bridge and its ServiceAccount only `get`/`list`/`watch`. No cluster mutations through this path. |
|
||||
| **Standard stack** | Prometheus, Loki, kube-state-metrics, node-exporter — not custom collectors unless necessary. |
|
||||
| **MCP as the interface** | One bridge, one contract. Agents do not talk to Prometheus/Loki/K8s APIs directly. |
|
||||
| **Deployable in one shot** | Ansible playbook + Helm charts; no manual chart-by-chart assembly. |
|
||||
| **Least privilege** | RBAC scoped to observation; NetworkPolicy limits egress; consider mTLS/OIDC for external exposure. |
|
||||
| **Agent-first ergonomics** | Pre-built resources (saved queries), tools (parameterized calls), and prompts (triage playbooks). |
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ LLM Agent (MCP client) │
|
||||
└──────────────────────────┬──────────────────────────────────┘
|
||||
│ MCP (resources / tools / prompts)
|
||||
┌──────────────────────────▼──────────────────────────────────┐
|
||||
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
|
||||
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
|
||||
└──────┬─────────────────┬────────────────────┬───────────────┘
|
||||
│ │ │
|
||||
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
|
||||
│ Prometheus │ │ Loki │ │ Kubernetes API │
|
||||
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
|
||||
│ Grafana │ │ │ │ │
|
||||
│ KSM │ │ │ │ │
|
||||
│ node-export │ │ │ │ │
|
||||
└─────────────┘ └───────────────┘ └─────────────────┘
|
||||
monitoring namespace logging namespace
|
||||
```
|
||||
|
||||
**Optional:** OpenTelemetry Collector for OTLP fan-out to Prometheus remote-write and Loki.
|
||||
|
||||
**Future:** Host-level DaemonSet sidecar for systemd status, package/cert checks, and other node facts not available through K8s metrics alone.
|
||||
|
||||
---
|
||||
|
||||
## What We Capture
|
||||
|
||||
### Minimum viable (current target)
|
||||
|
||||
**Kubernetes**
|
||||
- Cluster & node status, conditions, taints
|
||||
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images)
|
||||
- Services, Events (especially Warning/Error)
|
||||
- Resource usage via Prometheus/cAdvisor/kube-state-metrics
|
||||
|
||||
**Logs & alerts**
|
||||
- Pod and node logs via Loki/Promtail
|
||||
- Default alert rules: node not ready, CrashLoopBackOff, API/etcd degradation, job failures
|
||||
|
||||
**Bridge surface**
|
||||
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
|
||||
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
|
||||
- Prompts: triage and operational playbooks
|
||||
|
||||
### Stretch (explicitly deferred)
|
||||
|
||||
- Host OS depth: systemd units, package updates, cert expiry, firewall summary, NTP drift
|
||||
- `systemd.status`, `tail.pod_logs` tools
|
||||
- Alertmanager API integration for active-alert summaries
|
||||
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
|
||||
- Multi-cluster federation
|
||||
- Write/mutate operations (out of scope unless a separate, gated path is designed)
|
||||
|
||||
---
|
||||
|
||||
## Repository Layout
|
||||
|
||||
| Path | Role |
|
||||
|------|------|
|
||||
| `ansible/` | Bootstrap: install Helm, deploy all charts |
|
||||
| `helm/values/` | Opinionated values for kube-prometheus-stack, Loki, OTel |
|
||||
| `helm/mcp-telemetry-bridge/` | Bridge chart: Deployment, RBAC, Service, NetworkPolicy |
|
||||
| `mcp-telemetry-bridge/` | FastAPI application implementing the MCP surface |
|
||||
| `environments/` | Per-environment overrides and notes |
|
||||
| `wiki/` | Extended design notes and blueprint |
|
||||
|
||||
---
|
||||
|
||||
## Current State (as of initial scaffold)
|
||||
|
||||
**Done**
|
||||
- Ansible playbook with `k8s_host` + `telemetry_stack` roles
|
||||
- Helm values for monitoring, logging, optional OTel collector
|
||||
- MCP bridge service with core tools and saved-query resources
|
||||
- Read-only ClusterRole/Binding for the bridge ServiceAccount
|
||||
- NetworkPolicy skeleton for the bridge
|
||||
- Health check and `/mcp/schema` discovery endpoint
|
||||
|
||||
**Not yet done / known gaps**
|
||||
- Bridge image is a placeholder (`ghcr.io/example/telemcp-bridge`); needs CI build and publish
|
||||
- MCP interface is HTTP REST-shaped, not full MCP protocol transport
|
||||
- Prompts: only `Triage-Now` stub; missing `Capacity-Check`, `CrashLoop-Playbook`
|
||||
- No Alertmanager integration in the bridge
|
||||
- No metrics-server chart (useful for `kubectl top` semantics)
|
||||
- No host-level DaemonSet sidecar for systemd/OS signals
|
||||
- NetworkPolicy egress may need K8s API (443) allowance
|
||||
- Wiki and README aligned to INTENT; keep them updated when scope shifts
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
We know TeleMcp is working when:
|
||||
|
||||
1. `ansible-playbook` brings up monitoring, logging, and bridge namespaces with healthy pods.
|
||||
2. `curl /mcp/schema` returns resources, tools, and prompts.
|
||||
3. An MCP-capable agent can query PromQL, run LogQL, list cluster objects, and pull an inventory snapshot **without direct API credentials**.
|
||||
4. Default alert rules fire on induced failures (node pressure, crash loop) and the agent can triage them via bridge tools.
|
||||
5. The entire stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host.
|
||||
|
||||
---
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Replacing Grafana or building a custom metrics database
|
||||
- Providing arbitrary shell/exec access to the cluster
|
||||
- Mutating cluster state (deploy, scale, delete) through the bridge
|
||||
- Supporting non-Linux or non-Kubernetes targets in v1
|
||||
- Vendor-specific APM (Datadog, New Relic, etc.) — OTel fan-out is the extension point
|
||||
|
||||
---
|
||||
|
||||
## How to Use This Document
|
||||
|
||||
- **Prioritize work** against the "Current State" gaps and "Minimum viable" capture list.
|
||||
- **Reject scope creep** that does not serve agent observability or repeatable deployment.
|
||||
- **Update this file** when intent shifts — e.g., adding write paths, new environments, or MCP transport changes.
|
||||
|
||||
For operational quick-start, see [README.md](README.md).
|
||||
For detailed component rationale, see [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md).
|
||||
90
README.md
90
README.md
@@ -1,55 +1,103 @@
|
||||
# TeleMcp
|
||||
|
||||
Telemetry + MCP bridge that auto-deploys on a Linux-based Kubernetes host via **Ansible + Helm**.
|
||||
It exposes read-only metrics, logs, and k8s object state through an **MCP server** so an LLM agent can bootstrap, monitor, and operate the host.
|
||||
**Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**
|
||||
|
||||
TeleMcp deploys a standard observability stack onto a Linux Kubernetes host via **Ansible + Helm**, then surfaces metrics, logs, and cluster state through a read-only **MCP bridge** so an LLM agent can bootstrap, monitor, triage, and operate the box.
|
||||
|
||||
> For project goals, scope, and design principles, see **[INTENT.md](INTENT.md)**.
|
||||
|
||||
## Components
|
||||
- **kube-prometheus-stack** (Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics)
|
||||
- **Loki + Promtail** (logs)
|
||||
- **OpenTelemetry Collector** (optional fan-out)
|
||||
- **mcp-telemetry-bridge** (FastAPI service exposing MCP resources/tools/prompts)
|
||||
|
||||
| Component | Namespace | Role |
|
||||
|-----------|-----------|------|
|
||||
| **kube-prometheus-stack** | `monitoring` | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics |
|
||||
| **Loki + Promtail** | `logging` | Log aggregation and shipping |
|
||||
| **OpenTelemetry Collector** | `observability` | Optional OTLP fan-out to Prometheus and Loki |
|
||||
| **mcp-telemetry-bridge** | `mcp` | FastAPI service exposing MCP resources, tools, and prompts |
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 0) Prereqs
|
||||
|
||||
- Ubuntu 24.04 host with k8s (k3s or kubeadm) reachable and `kubectl` context configured
|
||||
- Ansible 2.15+ on your control machine
|
||||
- Helm 3 on the host (Ansible role installs if missing)
|
||||
|
||||
### 1) Run Ansible
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook -i inventories/local.ini playbook.yml
|
||||
```
|
||||
|
||||
### 2) Smoke tests (from any machine with kubectl context)
|
||||
### 2) Smoke tests
|
||||
|
||||
From any machine with a `kubectl` context:
|
||||
|
||||
```bash
|
||||
kubectl get pods -n monitoring
|
||||
kubectl get pods -n logging
|
||||
kubectl get pods -n mcp
|
||||
kubectl port-forward -n mcp svc/mcp-telemetry-bridge 8080:80
|
||||
curl http://localhost:8080/mcp/schema | jq .
|
||||
curl http://localhost:8080/healthz
|
||||
```
|
||||
|
||||
### 3) Point your LLM Agent
|
||||
Configure your agent's MCP client to the service endpoint (ClusterIP/Ingress).
|
||||
Use tools:
|
||||
- `promql.query`
|
||||
- `loki.query`
|
||||
- `k8s.get`
|
||||
- `k8s.events`
|
||||
- `inventory.snapshot`
|
||||
### 3) Point your LLM agent
|
||||
|
||||
Configure your agent's MCP client to the bridge endpoint (ClusterIP, Ingress, or port-forward).
|
||||
|
||||
**Implemented tools:**
|
||||
|
||||
| Tool | Description |
|
||||
|------|-------------|
|
||||
| `promql.query` | Run a PromQL expression against Prometheus |
|
||||
| `loki.query` | Run a LogQL query against Loki |
|
||||
| `k8s.get` | Fetch Kubernetes objects (pods, nodes, deployments, etc.) |
|
||||
| `k8s.events` | List cluster or namespace events |
|
||||
| `inventory.snapshot` | JSON snapshot of nodes, namespaces, and workloads |
|
||||
|
||||
**Saved resources** (via `/mcp/resource?uri=...`):
|
||||
|
||||
- `res://dashboards/top-pods-by-cpu.promql`
|
||||
- `res://dashboards/pod-restarts.promql`
|
||||
- `res://dashboards/warn-events.logql`
|
||||
|
||||
> The bridge currently exposes an HTTP schema approximation (`/mcp/schema`, `/tools/...`). Full MCP transport (stdio/SSE) is planned — see [INTENT.md](INTENT.md).
|
||||
|
||||
## Repo layout
|
||||
|
||||
```
|
||||
tele-mcp/
|
||||
ansible/
|
||||
INTENT.md # Project north star — goals, scope, current state
|
||||
ansible/ # Bootstrap playbook and roles
|
||||
helm/
|
||||
mcp-telemetry-bridge/
|
||||
environments/
|
||||
values/ # Chart values for monitoring, logging, OTel
|
||||
mcp-telemetry-bridge/ # Bridge Helm chart
|
||||
mcp-telemetry-bridge/ # FastAPI bridge application
|
||||
environments/ # Per-environment overrides
|
||||
wiki/ # Extended project and design docs
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
| Document | Purpose |
|
||||
|----------|---------|
|
||||
| [INTENT.md](INTENT.md) | Goals, principles, scope, success criteria |
|
||||
| [wiki/TeleMcpProject.md](wiki/TeleMcpProject.md) | Project overview and audience |
|
||||
| [wiki/TeleMcpBlueprint.md](wiki/TeleMcpBlueprint.md) | Component rationale and bridge design |
|
||||
| [environments/dev/README.md](environments/dev/README.md) | Dev environment notes |
|
||||
|
||||
## Security
|
||||
- MCP bridge ServiceAccount is read-only (RBAC get/list/watch)
|
||||
- Optional NetworkPolicy limits egress/ingress
|
||||
- Consider mTLS/OIDC if exposing outside the cluster
|
||||
|
||||
- MCP bridge ServiceAccount is read-only (`get` / `list` / `watch` only)
|
||||
- NetworkPolicy limits bridge egress to Prometheus and Loki
|
||||
- Consider mTLS or OIDC if exposing the bridge outside the cluster
|
||||
|
||||
## Current limitations
|
||||
|
||||
See [INTENT.md — Current State](INTENT.md#current-state-as-of-initial-scaffold) for the full list. Notable gaps:
|
||||
|
||||
- Bridge container image is a placeholder (`ghcr.io/example/telemcp-bridge`)
|
||||
- No Alertmanager integration in the bridge yet
|
||||
- Host-level signals (systemd, certs, firewall) are deferred to a future DaemonSet sidecar
|
||||
32
SCOPE.md
Normal file
32
SCOPE.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# SCOPE
|
||||
|
||||
> This file was generated by `statehub register`. Refine it as the repository
|
||||
> boundaries become clearer.
|
||||
|
||||
## One-liner
|
||||
|
||||
**Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**.
|
||||
|
||||
## Core Idea
|
||||
|
||||
tele-mcp exists to provide the capability described in INTENT.md.
|
||||
|
||||
## In Scope
|
||||
|
||||
- Maintain the repository's primary implementation.
|
||||
- Keep docs, tests, and operational metadata current.
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Own unrelated adjacent systems.
|
||||
- Make irreversible operational decisions without human approval.
|
||||
|
||||
## Current State
|
||||
|
||||
- Status: active; implementation and stability should be verified by the repo agent.
|
||||
|
||||
## Getting Oriented
|
||||
|
||||
- Start with: INTENT.md
|
||||
- Agent instructions: AGENTS.md
|
||||
- Workplans: workplans/
|
||||
183
wiki/TeleMcpBlueprint.md
Normal file
183
wiki/TeleMcpBlueprint.md
Normal file
@@ -0,0 +1,183 @@
|
||||
# TeleMcp Blueprint
|
||||
|
||||
*Building a Kubernetes telemetry MCP bridge*
|
||||
|
||||
> **Source:** [Original design conversation](https://chatgpt.com/share/68bdf06d-8c2c-8009-90c5-466f9f531d9a)
|
||||
> **Authority:** Scope and priorities are governed by [INTENT.md](../INTENT.md). This document explains *why* each component exists and *how* the bridge is shaped.
|
||||
|
||||
## Overview
|
||||
|
||||
Blueprint for a telemetry service + MCP bridge that auto-deploys on a Linux-based Kubernetes host (k3s or standard k8s) via Ansible + Helm, and exposes everything an LLM agent needs to bootstrap, monitor, and operate the box.
|
||||
|
||||
MCP acts as the standardized "USB-C" between the LLM agent and your telemetry — see the [Model Context Protocol spec](https://modelcontextprotocol.io).
|
||||
|
||||
---
|
||||
|
||||
## What we capture
|
||||
|
||||
### Minimum viable (current target)
|
||||
|
||||
**Kubernetes (control + workloads)**
|
||||
|
||||
- Cluster and node status, taints, conditions, kubelet health
|
||||
- Namespaces, Deployments, StatefulSets, DaemonSets, Pods (phase, restarts, images, age)
|
||||
- Services, Events (warning/error)
|
||||
- Resource usage per pod/node/namespace via Prometheus, cAdvisor, and kube-state-metrics
|
||||
|
||||
**Logs and alerts**
|
||||
|
||||
- Pod and node logs via Loki/Promtail
|
||||
- Default alert rules: node not ready, API/etcd degradation, CrashLoopBackOff, job failures
|
||||
|
||||
**Bridge surface**
|
||||
|
||||
- Tools: `promql.query`, `loki.query`, `k8s.get`, `k8s.events`, `inventory.snapshot`
|
||||
- Resources: saved PromQL/LogQL queries, cluster inventory snapshots
|
||||
- Prompts: triage and operational playbooks (`Triage-Now` implemented; others planned)
|
||||
|
||||
### Stretch (deferred)
|
||||
|
||||
**Host (Linux / node)**
|
||||
|
||||
- CPU, memory, disk, inode, filesystem, network, NIC errors *(partially covered by node-exporter)*
|
||||
- Distro/kernel/version, packages/updates
|
||||
- Systemd unit status for key services (container runtime, kubelet, nginx, etc.)
|
||||
- Certificates (expiry), time sync status (chrony/ntp)
|
||||
- Firewall/ports (nftables/ufw summary)
|
||||
|
||||
**Additional Kubernetes signals**
|
||||
|
||||
- Ingress, Jobs/CronJobs, HPA/VPA
|
||||
- Throttling and OOM kill detail beyond default metrics
|
||||
|
||||
**Additional bridge capabilities**
|
||||
|
||||
- `systemd.status`, `tail.pod_logs` tools
|
||||
- Alertmanager API for active-alert summaries
|
||||
- Full MCP transport (stdio/SSE) vs. current HTTP schema approximation
|
||||
|
||||
---
|
||||
|
||||
## Reference architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ LLM Agent (MCP client) │
|
||||
└──────────────────────────┬──────────────────────────────────┘
|
||||
│ MCP (resources / tools / prompts)
|
||||
┌──────────────────────────▼──────────────────────────────────┐
|
||||
│ mcp-telemetry-bridge (FastAPI, namespace: mcp) │
|
||||
│ Read-only proxy to Prometheus, Loki, Kubernetes API │
|
||||
└──────┬─────────────────┬────────────────────┬───────────────┘
|
||||
│ │ │
|
||||
┌──────▼──────┐ ┌───────▼───────┐ ┌────────▼────────┐
|
||||
│ Prometheus │ │ Loki │ │ Kubernetes API │
|
||||
│ Alertmanager│ │ Promtail │ │ (in-cluster SA) │
|
||||
│ Grafana │ │ │ │ │
|
||||
│ KSM │ │ │ │ │
|
||||
│ node-export │ │ │ │ │
|
||||
└─────────────┘ └───────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
### On the cluster
|
||||
|
||||
| Component | Status | Role |
|
||||
|-----------|--------|------|
|
||||
| [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) | **Deployed** | Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, default rules |
|
||||
| [Loki](https://grafana.com/docs/loki/latest/) + [Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) | **Deployed** | Log aggregation and shipping |
|
||||
| [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) | **Deployed (optional)** | OTLP in → Prometheus remote-write / Loki out |
|
||||
| [metrics-server](https://github.com/kubernetes-sigs/metrics-server) | Planned | Live resource metrics (`kubectl top` semantics) |
|
||||
| Host DaemonSet sidecar | Planned | systemd, cert, and OS-level facts |
|
||||
|
||||
We use standard CNCF pieces so agents reason in **PromQL** and **LogQL** and call a single MCP server for answers.
|
||||
|
||||
---
|
||||
|
||||
## Why these charts?
|
||||
|
||||
| Chart | Rationale |
|
||||
|-------|-----------|
|
||||
| **kube-prometheus-stack** | One Helm install for Prometheus Operator, Alertmanager, Grafana, node-exporter, KSM, dashboards, and alert rules |
|
||||
| **Loki + Promtail** | Cheap, scalable log storage without bolting logs into Prometheus |
|
||||
| **OTel Collector** | Vendor-agnostic OTLP ingress; fan-out to existing backends without re-architecting |
|
||||
|
||||
Ansible copies opinionated values from `helm/values/` and runs `helm upgrade --install` for each chart. See `ansible/roles/telemetry_stack/tasks/main.yml`.
|
||||
|
||||
---
|
||||
|
||||
## MCP Telemetry Bridge
|
||||
|
||||
The bridge (`mcp-telemetry-bridge/`) is the key piece — a small FastAPI service that implements the MCP surface (resources, tools, prompts).
|
||||
|
||||
### Implementation status
|
||||
|
||||
| Capability | Status |
|
||||
|------------|--------|
|
||||
| FastAPI service with health check | Done |
|
||||
| `/mcp/schema` discovery endpoint | Done |
|
||||
| `promql.query` | Done |
|
||||
| `loki.query` | Done |
|
||||
| `k8s.get` | Done |
|
||||
| `k8s.events` | Done |
|
||||
| `inventory.snapshot` | Done |
|
||||
| Saved PromQL/LogQL resources | Done (3 queries) |
|
||||
| `Triage-Now` prompt | Stub |
|
||||
| `Capacity-Check`, `CrashLoop-Playbook` prompts | Planned |
|
||||
| `systemd.status` | Planned (requires DaemonSet sidecar) |
|
||||
| `tail.pod_logs` | Planned |
|
||||
| Alertmanager API | Planned |
|
||||
| Full MCP protocol transport | Planned |
|
||||
|
||||
### Read-only backends
|
||||
|
||||
The bridge talks read-only to:
|
||||
|
||||
- **Prometheus HTTP API** — instant and range queries
|
||||
- **Loki HTTP API** — LogQL queries
|
||||
- **Kubernetes API** — ServiceAccount with RBAC `get`/`list`/`watch`
|
||||
- **Alertmanager API** — planned for active-alert summaries
|
||||
- **Node sidecar HTTP** — planned for host-level facts
|
||||
|
||||
### Tools (target API)
|
||||
|
||||
```
|
||||
promql.query(expr, range?)
|
||||
loki.query(logql, limit?, since?)
|
||||
k8s.get(kind, namespace?, name?)
|
||||
k8s.events(namespace?, since?)
|
||||
inventory.snapshot() → JSON
|
||||
systemd.status(unit) # planned
|
||||
```
|
||||
|
||||
### Resources
|
||||
|
||||
```
|
||||
res://dashboards/top-pods-by-cpu.promql # implemented
|
||||
res://dashboards/pod-restarts.promql # implemented
|
||||
res://dashboards/warn-events.logql # implemented
|
||||
res://snapshots/cluster-inventory.json # planned (dynamic)
|
||||
```
|
||||
|
||||
### Prompts
|
||||
|
||||
```
|
||||
Triage-Now # stub — summarize alerts, top offenders, recent warnings
|
||||
Capacity-Check # planned
|
||||
CrashLoop-Playbook # planned
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security model
|
||||
|
||||
- Bridge runs under a dedicated ServiceAccount with a ClusterRole limited to `get`/`list`/`watch`
|
||||
- NetworkPolicy restricts egress to Prometheus (9090) and Loki (3100); K8s API (443) allowance may be needed
|
||||
- External exposure should use mTLS or OIDC — the bridge is not authenticated in v1
|
||||
|
||||
---
|
||||
|
||||
## Related docs
|
||||
|
||||
- [INTENT.md](../INTENT.md) — goals, scope, success criteria, known gaps
|
||||
- [README.md](../README.md) — quick start and smoke tests
|
||||
- [TeleMcpProject.md](TeleMcpProject.md) — project overview and audience
|
||||
73
wiki/TeleMcpProject.md
Normal file
73
wiki/TeleMcpProject.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# TeleMcp Project
|
||||
|
||||
*Telemetry for autonomous control*
|
||||
|
||||
## What is TeleMcp?
|
||||
|
||||
TeleMcp is **mission control for Kubernetes hosts**. It collects health, performance, and alert signals from a Linux k8s cluster and exposes them through a single **Model Context Protocol (MCP)** interface so intelligent assistants can understand what's happening, triage problems, and help keep systems running smoothly — without constant human supervision.
|
||||
|
||||
The project name reflects its two halves:
|
||||
|
||||
- **Tele** — telemetry: metrics, logs, events, and cluster inventory
|
||||
- **MCP** — the standardized bridge between observability backends and LLM agents
|
||||
|
||||
## Who is it for?
|
||||
|
||||
- **Operators** who want repeatable, one-command observability on a k3s or kubeadm host
|
||||
- **LLM agent builders** who need a safe, read-only API for cluster situational awareness
|
||||
- **Developers** running local or edge Kubernetes who want agent-assisted monitoring without wiring up bespoke integrations
|
||||
|
||||
## What problem does it solve?
|
||||
|
||||
Running a Kubernetes host means tracking signals across many systems. Humans reach for Grafana, `kubectl`, and ad-hoc PromQL. Agents need the same information through a **standardized, safe contract** — not raw shell access or scattered API credentials.
|
||||
|
||||
TeleMcp solves this in three steps:
|
||||
|
||||
1. **Collect** — deploy Prometheus, Loki, and supporting exporters via Helm
|
||||
2. **Deploy** — bootstrap everything with a single Ansible playbook
|
||||
3. **Bridge** — expose resources, tools, and prompts through `mcp-telemetry-bridge`
|
||||
|
||||
## What can an agent do today?
|
||||
|
||||
With the current scaffold, an agent connected to the bridge can:
|
||||
|
||||
- Query Prometheus with `promql.query`
|
||||
- Search logs with `loki.query`
|
||||
- Inspect Kubernetes objects with `k8s.get` and `k8s.events`
|
||||
- Pull a cluster inventory snapshot with `inventory.snapshot`
|
||||
- Use pre-built PromQL/LogQL resources for common triage queries
|
||||
|
||||
## What is planned?
|
||||
|
||||
Stretch goals — explicitly deferred in v1 — include host-level signals (systemd status, cert expiry, firewall summary), Alertmanager integration, additional prompts (`Capacity-Check`, `CrashLoop-Playbook`), and full MCP protocol transport. See [INTENT.md](../INTENT.md) for the authoritative scope list.
|
||||
|
||||
## Design principles
|
||||
|
||||
| Principle | Summary |
|
||||
|-----------|---------|
|
||||
| Read-only by default | No cluster mutations through the bridge |
|
||||
| Standard stack | CNCF/Grafana components, not custom collectors |
|
||||
| MCP as the interface | One bridge, one contract for agents |
|
||||
| Deployable in one shot | Ansible + Helm, no manual assembly |
|
||||
| Least privilege | Scoped RBAC and NetworkPolicy |
|
||||
|
||||
## Repository map
|
||||
|
||||
| Path | Contents |
|
||||
|------|----------|
|
||||
| [INTENT.md](../INTENT.md) | North star — goals, scope, current state |
|
||||
| [README.md](../README.md) | Quick start and operational guide |
|
||||
| [TeleMcpBlueprint.md](TeleMcpBlueprint.md) | Architecture and component rationale |
|
||||
| `ansible/` | Bootstrap playbook |
|
||||
| `helm/` | Chart values and bridge chart |
|
||||
| `mcp-telemetry-bridge/` | FastAPI bridge source |
|
||||
|
||||
## Success criteria
|
||||
|
||||
TeleMcp is working when:
|
||||
|
||||
1. `ansible-playbook` brings up healthy pods in `monitoring`, `logging`, and `mcp` namespaces
|
||||
2. `/mcp/schema` returns resources, tools, and prompts
|
||||
3. An agent can query metrics, logs, and cluster state without direct API credentials
|
||||
4. Default alert rules fire on induced failures and the agent can triage them
|
||||
5. The stack redeploys cleanly on a fresh Ubuntu 24.04 + k3s/kubeadm host
|
||||
54
workplans/TELE-WP-0001-statehub-bootstrap.md
Normal file
54
workplans/TELE-WP-0001-statehub-bootstrap.md
Normal file
@@ -0,0 +1,54 @@
|
||||
---
|
||||
id: TELE-WP-0001
|
||||
type: workplan
|
||||
title: "Bootstrap State Hub integration"
|
||||
domain: infotech
|
||||
repo: tele-mcp
|
||||
status: ready
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-22"
|
||||
updated: "2026-06-22"
|
||||
---
|
||||
|
||||
# Bootstrap State Hub integration
|
||||
|
||||
**Mission control for Kubernetes hosts, exposed to LLM agents through MCP.**.
|
||||
|
||||
## Review Generated Integration Files
|
||||
|
||||
```task
|
||||
id: TELE-WP-0001-T01
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Review `INTENT.md`, `SCOPE.md`, `AGENTS.md`, and `.custodian-brief.md`.
|
||||
Replace generated placeholders with repo-specific facts where needed.
|
||||
|
||||
## Verify Local Developer Workflow
|
||||
|
||||
```task
|
||||
id: TELE-WP-0001-T02
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Identify the repo's install, test, lint, build, and run commands. Add or refine
|
||||
those commands in the agent instructions so future coding sessions can verify
|
||||
changes confidently.
|
||||
|
||||
## Seed First Real Workplan
|
||||
|
||||
```task
|
||||
id: TELE-WP-0001-T03
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Create the first implementation workplan for the repository's most important
|
||||
next change. After workplan file updates, run from `~/state-hub`:
|
||||
|
||||
```bash
|
||||
make fix-consistency REPO=tele-mcp
|
||||
```
|
||||
Reference in New Issue
Block a user