# CYA Speech Mode Extension ## Working Name **cya speech mode** Possible short names: - `cya talk` - `cya voice` - `cya listen` - `cya pair` - `cya phone` --- ## Core Idea `cya` should support a speech interaction mode where the user can use a phone as the microphone, speech-recognition device, and conversational front-end for a `cya` helper session running on a terminal. This allows users to speak to `cya` even when the active terminal environment has no usable microphone, no audio stack, no GUI, or no convenient speech-recognition capability. The terminal remains the operational context. The phone becomes the voice interface. The LLM interaction remains bound to the currently activated `cya` session. --- ## User Scenario A user is working in a terminal on a server, VM, SSH session, minimal Linux installation, or development machine. They want to ask: > “What does this error mean?” > “Generate a command to find all Python files importing requests.” > “Explain the current git status.” > “Create a commit message for these changes.” > “Summarize the files in this folder.” Instead of typing the request, they activate a `cya` helper session in the terminal and connect the phone app. The phone enters conversation mode and uses its microphone and speech recognition. The recognized text is sent to the active `cya` session. The terminal-side `cya` helper answers in the terminal, and optionally also sends a spoken or textual response back to the phone. --- ## Basic Interaction Flow ### 1. User activates CYA in terminal Example: ```bash cya talk ```` or: ```bash cya --voice ``` The CLI starts a local or remote helper session and displays a pairing instruction. Example: ```text CYA voice session started. Pair with phone: Open the CYA mobile app and scan this QR code. Session: repo: /home/bernd/projects/example mode: voice bridge expires: 5 minutes ``` The terminal may show a QR code containing a short-lived pairing token. --- ### 2. User opens the CYA phone app The app scans the QR code or enters a pairing code. The phone is now connected to the active `cya` session. The app shows: ```text Connected to: example repo on tinker-base Listening... ``` --- ### 3. User speaks into the phone The phone captures audio and performs speech recognition. The recognized text is sent to the active terminal-side `cya` helper. Example recognized utterance: ```text Please explain the current git status and suggest a commit message. ``` --- ### 4. Terminal-side CYA processes the request The terminal-side `cya` helper has access to the local context: * current working directory; * selected files; * git status; * shell environment; * project memory; * user preferences; * configured LLM backend through `llm-connect`; * memory through `phase-memory`. The phone does **not** need direct access to the filesystem. That is important. The phone provides speech input. The CLI session owns the local operational context. --- ### 5. Response is returned The response appears in the terminal. Optionally, a concise response is also sent back to the phone. Example terminal response: ```text Current git status summary: - 3 modified files - 1 new file - no staged changes Suggested commit message: feat(cli): add initial voice session pairing flow Suggested next command: git add src/voice-session.ts README.md git commit -m "feat(cli): add initial voice session pairing flow" ``` The phone may show: ```text I found three modified files and one new file. Suggested commit message: feat(cli): add initial voice session pairing flow ``` Optionally, the phone can read that aloud. --- ## Key Design Principle The phone should be a **voice bridge**, not the primary authority. The terminal session remains the source of truth for: * current working directory; * filesystem access; * repository context; * execution permissions; * local configuration; * project memory; * safety confirmations. The phone handles: * microphone input; * speech recognition; * maybe text-to-speech; * session pairing; * conversational convenience. This avoids turning the phone app into a remote filesystem client or full IDE. --- ## Conceptual Architecture ```text +------------------+ +----------------------+ +------------------+ | | speech | | text | | | Phone App +----------> CYA Session Broker +----------> CYA CLI Session | | | text | | result | | +--------+---------+<----------+----------+-----------+<----------+--------+---------+ | | | | | | v v v Microphone / STT Pairing / Routing Local Context Text-to-Speech Session Registry Filesystem Conversation UI Authentication Git / Shell llm-connect phase-memory ``` --- ## Main Components ### 1. CYA CLI Voice Session The CLI needs a mode that creates an active voice-addressable session. Responsibilities: * create session ID; * generate short-lived pairing token; * register itself with a broker; * expose current terminal context; * receive transcribed user messages; * process messages through normal `cya` assistance flow; * return responses; * enforce safety and confirmation rules. Possible command: ```bash cya talk ``` or: ```bash cya session start --voice ``` --- ### 2. CYA Phone App The phone app provides the user-facing speech interface. Responsibilities: * scan QR code or enter pairing code; * establish secure connection to active session; * capture microphone input; * run speech recognition; * show transcript before sending, depending on mode; * send text requests to the active `cya` session; * display responses; * optionally read responses aloud; * allow pause, mute, reconnect, and disconnect. The app does not need to know the details of the user’s filesystem. --- ### 3. Session Broker A broker is needed to connect the phone to the CLI session. This could be implemented in different ways. #### Option A: Local Network Broker The CLI starts a small local WebSocket server. The phone connects directly over the LAN. Good for: * home networks; * office networks; * local development; * no cloud dependency. Challenges: * NAT/firewall issues; * phone and terminal must be on same network; * HTTPS/TLS handling; * service discovery. Example: ```bash cya talk --listen 192.168.1.42:47391 ``` QR code contains: ```text cya://pair?host=192.168.1.42&port=47391&token=... ``` --- #### Option B: Relay Broker A small relay service connects phone and CLI. Good for: * SSH sessions; * cloud servers; * NAT traversal; * mobile networks; * remote development. The relay does not need to see filesystem data if messages are end-to-end encrypted or if the relay only routes session traffic. QR code contains: ```text cya://pair?relay=https://relay.cya.example&session=...&token=... ``` Challenges: * requires hosted infrastructure; * introduces trust and privacy questions; * should be optional, not mandatory. --- #### Option C: User-Controlled Broker Advanced users can self-host the broker. This fits the philosophy of `cya`. Possible command: ```bash cya-broker serve ``` Then: ```bash cya talk --broker https://my-broker.example ``` This preserves the user-controlled infrastructure principle. --- ## Recommended Implementation Path A practical implementation path could have four stages. --- ## Stage 1: Text Bridge Prototype Do not start with speech. Start with a phone-to-terminal text bridge. Goal: * CLI starts a session. * Phone connects. * User types text on phone. * Text appears in the active `cya` session. * CLI processes request and returns response. This proves: * pairing; * routing; * session ownership; * broker model; * authentication; * terminal-side context handling. Example: ```bash cya talk ``` Phone app: ```text Connected. Type your request. ``` This avoids early complexity around speech APIs. --- ## Stage 2: Speech-to-Text on Phone Add microphone and speech recognition. The phone converts speech to text locally or through platform speech recognition. Options: * iOS speech recognition; * Android speech recognition; * browser-based Web Speech API for a PWA; * local speech model on device where feasible; * cloud speech-to-text if user permits. The important architectural point: > `cya` receives text, not raw audio, unless explicitly configured otherwise. This keeps the CLI helper simple and avoids requiring audio handling on the terminal machine. --- ## Stage 3: Conversation Mode Add continuous or semi-continuous voice interaction. Modes: ### Push-to-Talk Mode Safest and easiest. User presses and holds a button, speaks, releases, and sends. Good default. ### Confirm-Before-Send Mode The app shows recognized text first. User taps send. Useful when commands may be risky. ### Continuous Conversation Mode The app listens continuously and sends utterances automatically. Useful but riskier. Should probably be opt-in. --- ## Stage 4: Bidirectional Voice Add optional text-to-speech responses. The phone can speak concise answers aloud. This should be configurable: ```bash cya talk --phone-speak concise cya talk --phone-speak off cya talk --phone-speak full ``` The terminal should still show the full response. The phone should preferably receive a summarized voice response to avoid reading long command explanations aloud. --- ## Safety Model Speech mode needs stricter safety defaults than typed CLI use. Speech recognition can mishear commands. Therefore: ### Never execute destructive commands directly from speech For example, if the user says: > Delete all generated files. `cya` should produce a preview and require explicit terminal confirmation. ### Require confirmation for execution Possible pattern: ```text Suggested command: rm -rf build/ This may delete files. Run it? [y/N] ``` The confirmation should happen in the terminal by default, not only on the phone. ### Separate “ask”, “suggest”, and “run” Useful command modes: ```bash cya talk --suggest-only cya talk --allow-run cya talk --no-exec ``` Default should be: ```bash cya talk --suggest-only ``` ### Show recognized transcript The phone should always make clear what it heard. For risky requests, it should require confirmation before sending. --- ## Privacy Model Speech mode touches sensitive areas: * spoken input; * local filesystem context; * repository contents; * personal memory; * command history; * notes. So the design should make privacy visible and controllable. The user should be able to see: * which phone is connected; * which terminal session it is connected to; * what context is being sent to the LLM; * which LLM backend is used; * what is stored in memory; * whether speech recognition is local or cloud-based; * whether a relay broker is involved. Example session banner: ```text CYA voice session active Phone: Bernd's iPhone Speech recognition: on-device if available, platform fallback allowed LLM backend: llm-connect: local-openrouter-default Memory: phase-memory: user + project preferences Broker: local websocket Execution: suggest-only ``` --- ## Pairing and Authentication Pairing should be short-lived and explicit. Possible pairing methods: ### QR Code Best default. ```bash cya talk ``` Terminal displays QR code. Phone scans it. ### Numeric Code Fallback for terminals that cannot show QR codes. ```text Pairing code: 482-119 Expires in 5 minutes. ``` ### Known Device Trust After initial pairing, user may mark a phone as trusted. Even then, each terminal voice session should be explicitly activated. Trusted device should not mean always-on access. --- ## Session Ownership Each voice interaction should be bound to a specific active `cya` session. That session has: * session ID; * current working directory; * user identity; * host identity; * start time; * allowed capabilities; * selected LLM backend; * selected memory scope; * execution policy. This prevents accidental cross-talk where the phone sends a request to the wrong terminal or wrong repository. Useful phone display: ```text Connected to: tinker-base /home/bernd/repos/can-you-assist Mode: suggest-only ``` --- ## Possible CLI Commands ### Start voice session ```bash cya talk ``` ### Start voice session for current repository ```bash cya talk --repo ``` ### Start with local broker ```bash cya talk --broker local ``` ### Start with relay broker ```bash cya talk --broker relay ``` ### Start with strict safety ```bash cya talk --suggest-only ``` ### Start with no memory ```bash cya talk --no-memory ``` ### Start with project memory ```bash cya talk --memory project ``` ### List active sessions ```bash cya sessions ``` ### End current session ```bash cya session stop ``` --- ## Example User Experience Terminal: ```bash cd ~/repos/can-you-assist cya talk ``` Terminal output: ```text CYA voice session started for: ~/repos/can-you-assist Mode: suggest-only Pair your phone: scan QR code or enter code 913-442 Waiting for phone... ``` Phone: ```text Connected to can-you-assist on tinker-base. Tap and speak. ``` User speaks: > What would be a good initial module structure for this project? Terminal: ```text You asked: "What would be a good initial module structure for this project?" Suggested initial structure: src/ cli/ main.ts commands/ session/ voice-session.ts pairing.ts context/ collector.ts git.ts filesystem.ts llm/ llm-connect-client.ts memory/ phase-memory-client.ts safety/ classifier.ts confirmation.ts Rationale: ... ``` Phone: ```text I suggested a module structure with CLI, session, context, LLM, memory, and safety layers. ``` --- ## Integration With `llm-connect` Speech mode should not bypass `llm-connect`. The flow should be: ```text Phone speech → transcript → cya CLI session → llm-connect → selected LLM backend → cya response → terminal and/or phone ``` This keeps backend selection consistent with normal `cya` behavior. The phone app should not directly call the LLM unless specifically configured for a future lightweight mode. --- ## Integration With `phase-memory` Speech mode should use `phase-memory` for: * preferred speech interaction style; * trusted phones; * known devices; * user voice mode preferences; * project-specific command conventions; * common workflows; * interaction summaries. Examples of remembered preferences: ```text User prefers speech responses to be concise. User wants destructive commands previewed but never executed automatically. User usually works with git commit messages in conventional commit style. User prefers explanations before suggested shell commands. ``` Memory should remain inspectable and editable. Possible commands: ```bash cya memory show cya memory edit cya memory forget voice.devices ``` --- ## Minimal Technical Design A minimal first implementation could use: ### CLI Side * `cya talk` starts a WebSocket server or connects to a broker. * It creates a session token. * It renders a QR code. * It listens for incoming text messages. * It routes messages through the normal `cya` assistant pipeline. ### Phone Side A Progressive Web App may be enough initially. The PWA can: * scan QR codes; * use browser speech recognition where available; * send text over WebSocket; * show responses. This avoids building native iOS and Android apps immediately. Later, native apps can provide better: * speech recognition; * background handling; * push-to-talk UX; * device trust; * text-to-speech; * secure local storage. ### Broker Side For the first version, choose one: #### Local-only prototype Simpler, private, no cloud. Good for proof of concept. #### Minimal relay prototype More useful for SSH and remote development. Better real-world fit. A good architectural compromise: * implement a broker interface; * start with local broker; * allow relay broker later; * keep protocol stable. --- ## Protocol Sketch Phone sends: ```json { "type": "user_message", "session_id": "cya_sess_123", "message_id": "msg_001", "input_mode": "speech", "transcript": "Explain the current git status and suggest a commit message.", "confidence": 0.91 } ``` CLI responds: ```json { "type": "assistant_response", "session_id": "cya_sess_123", "message_id": "msg_001", "terminal_response": "...full response...", "phone_response": "I found three modified files and one new file. Suggested commit message: ...", "requires_confirmation": false } ``` For risky actions: ```json { "type": "assistant_response", "session_id": "cya_sess_123", "message_id": "msg_002", "terminal_response": "Suggested command: rm -rf build/", "phone_response": "This may delete files. Please confirm in the terminal.", "requires_confirmation": true, "confirmation_channel": "terminal" } ``` --- ## Important Product Distinction This extension should not be framed as: > “CYA becomes a mobile assistant.” It should be framed as: > “The phone becomes a microphone and conversation surface for the active terminal helper.” That distinction protects the project scope. The center remains the console. --- ## Updated INTENT Addition This could be added to the `INTENT.md` under long-term direction or primary use cases: ```markdown ### Speech-Assisted Console Interaction `cya` should eventually support a speech interaction mode where a phone or other capable device can act as a microphone, speech-recognition frontend, and lightweight conversation surface for an active `cya` CLI session. This enables voice interaction even when the terminal environment itself has no microphone, audio stack, graphical interface, or speech-recognition capability. In this mode, the phone does not become the primary execution environment. Instead, it connects to a currently activated `cya` helper session. The CLI session remains responsible for local filesystem context, repository context, memory scope, LLM backend selection, and safety confirmation. The phone provides convenient speech input and optional spoken output, while `cya` preserves its console-native, user-controlled architecture. ``` --- ## My Recommended Direction I would treat this as a distinct but natural extension: ```text cya-core console assistant cya-voice speech bridge protocol and session mode cya-mobile phone app or PWA cya-broker optional pairing and relay service ``` The first implementation should probably be: ```text cya talk + local WebSocket session + QR pairing + phone PWA + push-to-talk speech recognition + suggest-only mode ``` That gives you the core magic without overbuilding the system. The deeper architectural insight is: > Speech mode should not make the terminal speak. > It should let a speech-capable companion device speak *to the terminal’s active assistant context*.