Files
railiance-infra/workplans/RAIL-HO-WP-0002-server-spec-and-test-suite.md
tegwick 703c57d91c chore(rename): railiance-hosts → railiance-infra
Update all operational references to reflect the new repo name per
ADR-003 (OAS S1 Infrastructure Substrate). Historical text in ADRs
and state-hub-inbox files preserved as-is. Gitea remote URL updated
locally (Gitea repo rename is a manual step).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 00:34:18 +01:00

7.3 KiB

id, type, title, domain, repo, status, owner, topic_slug, state_hub_workstream_id, created, updated, completed
id type title domain repo status owner topic_slug state_hub_workstream_id created updated completed
RAIL-HO-WP-0002 workplan Server Specification and Automated Test Suite railiance railiance-infra completed railiance railiance 8fed53c2-4c39-4471-8bb9-61f58771fe0c 2026-03-09 2026-03-09 2026-03-09

Server Specification and Automated Test Suite

Motivation

make status produces raw shell output that requires manual interpretation. There is no machine-readable specification of what a converged Railiance node should look like, and therefore no way to assert automatically whether a server is in the correct state.

This workplan closes that gap by introducing:

  1. A declarative server specification (spec/server-baseline.yaml) — the single source of truth for the target state of every managed node.
  2. A Goss test suite derived from that spec — YAML assertions that map one-to-one to spec items and produce a structured pass/fail report.
  3. make verify — runs the test suite against all hosts and exits non-zero on failure, suitable for CI.
  4. An ADR that formally defines the boundary between railiance-hosts and railiance-bootstrap.

Concept

Separation of concerns

Repo Responsibility
railiance-hosts What a managed node should look like (spec), how to get it there (Ansible roles), how to verify it got there (Goss tests), inventory, secrets
railiance-bootstrap Upstream Kubernetes/app-layer provisioning that builds on an already-converged base node; does NOT own security baseline

The railiance-bootstrap ansible work (harden.yml, bootstrap.yml) is superseded by roles/base and roles/sops_agent in this repo. Going forward, any security or OS-level configuration belongs here. railiance-bootstrap may consume a node that has already been converged by this repo, but must not re-configure items owned here.

Test framework: Goss

Goss is a Go binary that evaluates YAML test files against the live node. It was chosen because:

  • Tests and spec map one-to-one (Goss YAML IS the assertion)
  • Single binary, no Python/Ruby runtime on target host
  • Fast (runs in-process, no SSH per test)
  • Output can be TAP, JSON, or human-readable
  • Deployable via Ansible in a single task

Directory layout

spec/
  server-baseline.yaml        ← authoritative target-state spec (already created)

goss/
  baseline.yaml               ← Goss assertions (derived from spec)
  vars/
    baseline-vars.yaml        ← parameterised values (ports, users, etc.)

ansible/
  playbooks/
    verify.yaml               ← deploy Goss + run tests + fetch results
  roles/
    goss/                     ← role: install binary, copy tests, run, report

Tasks

T01 — Resolve duplicate converge target and fix SSH check

id: T01
status: done
completed: "2026-03-09"
priority: high
state_hub_task_id: "892f8bb8-beff-463a-b47c-ffd9a672d065"
  • Remove redundant converge: ansible-bootstrap alias (caused Makefile warning)
  • Fix sshd -T command (requires hostkeys) → replaced with grep -iE '^(PermitRootLogin|PasswordAuthentication)' /etc/ssh/sshd_config

Done when: make status completes without warnings and SSH section returns PermitRootLogin no / PasswordAuthentication no.


T02 — Finalise server baseline spec

id: T02
status: done
completed: "2026-03-09"
priority: high
state_hub_task_id: "293d950e-c0b3-4ae2-ac08-dcbf3fe5b114"

Created spec/server-baseline.yaml covering:

  • Firewall rules (UFW, default deny, allowed ports)
  • SSH daemon settings
  • Required services and packages
  • Admin user constraints
  • Security settings (fail2ban jails, HISTCONTROL)

Done when: spec reviewed and agreed — it becomes the contract that roles and tests must satisfy.


T03 — Implement Goss test suite

id: T03
status: done
completed: "2026-03-09"
priority: high
state_hub_task_id: "a34a1626-ff38-4925-a957-d94036fbded6"

Create goss/baseline.yaml with Goss assertions that implement every item in spec/server-baseline.yaml. Each spec section maps to a Goss resource type:

spec section Goss resource
firewall.status command: ufw status
firewall.rules command: ufw status stdout contains
ssh.* file: /etc/ssh/sshd_config contains
services service: blocks
packages package: blocks
users user: + file: /etc/sudoers.d/admin

Example structure:

# goss/baseline.yaml
package:
  ufw:
    installed: true
  fail2ban:
    installed: true

service:
  ufw:
    enabled: true
    running: true
  fail2ban:
    enabled: true
    running: true

file:
  /etc/ssh/sshd_config:
    exists: true
    contains:
      - /^PermitRootLogin no/
      - /^PasswordAuthentication no/

command:
  ufw status:
    exit-status: 0
    stdout:
      - "Status: active"
      - "22/tcp.*ALLOW"
      - "6443/tcp.*ALLOW"
      - "8472/udp.*ALLOW"

user:
  admin:
    exists: true
    groups:
      - sudo
    shell: /bin/bash

Done when: goss validate passes on a freshly converged node.


T04 — Ansible role and playbook for Goss

id: T04
status: done
completed: "2026-03-09"
priority: high
state_hub_task_id: "c072c45b-f18d-45be-b747-6d219c3f1439"

Create ansible/roles/goss/ with tasks that:

  1. Download the Goss binary (pinned version) to /usr/local/bin/goss
  2. Copy goss/baseline.yaml to /etc/goss/baseline.yaml
  3. Run goss -g /etc/goss/baseline.yaml validate --format tap
  4. Fetch the TAP output back to the control node as reports/goss-<host>-<date>.tap
  5. Fail the play if any test fails (rc != 0)

Create ansible/playbooks/verify.yaml:

- hosts: all
  become: true
  roles:
    - role: goss

Done when: ansible-playbook ansible/playbooks/verify.yaml exits 0 on a clean node, non-zero on a deliberately broken one (test with a manual config change).


T05 — Add make verify target

id: T05
status: done
completed: "2026-03-09"
priority: medium
state_hub_task_id: "a8100b8e-aed0-4bb4-a0dc-a6bdf3938b8d"

Add to Makefile:

verify: ## Run Goss test suite against all hosts — exits non-zero on failure
	cd $(ANS_DIR) && ansible-playbook playbooks/verify.yaml -u $(SSH_USER)

Also update make status to print a summary line ("All assertions passed" / "N assertions FAILED") rather than raw shell output.

Done when: make verify exits 0 on a good node, non-zero on a bad one.


T06 — Write ADR: railiance-hosts vs railiance-bootstrap boundary

id: T06
status: done
completed: "2026-03-09"
priority: medium
state_hub_task_id: "c3d98022-638d-4dcb-bdc7-a9501e1b6cd9"

Create docs/adr/ADR-002-repo-boundary-hosts-vs-bootstrap.md documenting:

  • What railiance-hosts owns (OS baseline, security, spec, tests)
  • What railiance-bootstrap owns (Kubernetes/app layer, consumes a converged node)
  • Decision: any item present in spec/server-baseline.yaml must NOT be managed by railiance-bootstrap
  • Migration note: superseded bootstrap.yml / harden.yml in that repo

Done when: ADR written and merged.


References

  • Goss documentation: https://github.com/goss-org/goss
  • Server spec: spec/server-baseline.yaml
  • Bootstrap workplan: workplans/RAIL-HO-WP-0001-hosteurope-bootstrap.md