can-you-assist/wiki/CyaSpeechModeExtension.md

# CYA Speech Mode Extension

## Working Name

**cya speech mode**

Possible short names:

- `cya talk`
- `cya voice`
- `cya listen`
- `cya pair`
- `cya phone`

---

## Core Idea

`cya` should support a speech interaction mode where the user can use a phone as the microphone, speech-recognition device, and conversational front-end for a `cya` helper session running on a terminal.

This allows users to speak to `cya` even when the active terminal environment has no usable microphone, no audio stack, no GUI, or no convenient speech-recognition capability.

The terminal remains the operational context.

The phone becomes the voice interface.

The LLM interaction remains bound to the currently activated `cya` session.

---

## User Scenario

A user is working in a terminal on a server, VM, SSH session, minimal Linux installation, or development machine.

They want to ask:

> “What does this error mean?”
> “Generate a command to find all Python files importing requests.”
> “Explain the current git status.”
> “Create a commit message for these changes.”
> “Summarize the files in this folder.”

Instead of typing the request, they activate a `cya` helper session in the terminal and connect the phone app.

The phone enters conversation mode and uses its microphone and speech recognition.

The recognized text is sent to the active `cya` session.

The terminal-side `cya` helper answers in the terminal, and optionally also sends a spoken or textual response back to the phone.

---

## Basic Interaction Flow

### 1. User activates CYA in terminal

Example:

```bash
cya talk
````

or:

```bash
cya --voice
```

The CLI starts a local or remote helper session and displays a pairing instruction.

Example:

```text
CYA voice session started.

Pair with phone:
  Open the CYA mobile app and scan this QR code.

Session:
  repo: /home/bernd/projects/example
  mode: voice bridge
  expires: 5 minutes
```

The terminal may show a QR code containing a short-lived pairing token.

---

### 2. User opens the CYA phone app

The app scans the QR code or enters a pairing code.

The phone is now connected to the active `cya` session.

The app shows:

```text
Connected to:
example repo on tinker-base

Listening...
```

---

### 3. User speaks into the phone

The phone captures audio and performs speech recognition.

The recognized text is sent to the active terminal-side `cya` helper.

Example recognized utterance:

```text
Please explain the current git status and suggest a commit message.
```

---

### 4. Terminal-side CYA processes the request

The terminal-side `cya` helper has access to the local context:

* current working directory;
* selected files;
* git status;
* shell environment;
* project memory;
* user preferences;
* configured LLM backend through `llm-connect`;
* memory through `phase-memory`.

The phone does **not** need direct access to the filesystem.

That is important.

The phone provides speech input.

The CLI session owns the local operational context.

---

### 5. Response is returned

The response appears in the terminal.

Optionally, a concise response is also sent back to the phone.

Example terminal response:

```text
Current git status summary:

- 3 modified files
- 1 new file
- no staged changes

Suggested commit message:

feat(cli): add initial voice session pairing flow

Suggested next command:

git add src/voice-session.ts README.md
git commit -m "feat(cli): add initial voice session pairing flow"
```

The phone may show:

```text
I found three modified files and one new file. Suggested commit message:
feat(cli): add initial voice session pairing flow
```

Optionally, the phone can read that aloud.

---

## Key Design Principle

The phone should be a **voice bridge**, not the primary authority.

The terminal session remains the source of truth for:

* current working directory;
* filesystem access;
* repository context;
* execution permissions;
* local configuration;
* project memory;
* safety confirmations.

The phone handles:

* microphone input;
* speech recognition;
* maybe text-to-speech;
* session pairing;
* conversational convenience.

This avoids turning the phone app into a remote filesystem client or full IDE.

---

## Conceptual Architecture

```text
+------------------+          +----------------------+          +------------------+
|                  | speech   |                      | text     |                  |
|   Phone App      +---------->  CYA Session Broker   +---------->  CYA CLI Session |
|                  | text     |                      | result   |                  |
+--------+---------+<----------+----------+-----------+<----------+--------+---------+
         |                               |                              |
         |                               |                              |
         v                               v                              v
  Microphone / STT               Pairing / Routing              Local Context
  Text-to-Speech                 Session Registry               Filesystem
  Conversation UI                Authentication                 Git / Shell
                                                                  llm-connect
                                                                  phase-memory
```

---

## Main Components

### 1. CYA CLI Voice Session

The CLI needs a mode that creates an active voice-addressable session.

Responsibilities:

* create session ID;
* generate short-lived pairing token;
* register itself with a broker;
* expose current terminal context;
* receive transcribed user messages;
* process messages through normal `cya` assistance flow;
* return responses;
* enforce safety and confirmation rules.

Possible command:

```bash
cya talk
```

or:

```bash
cya session start --voice
```

---

### 2. CYA Phone App

The phone app provides the user-facing speech interface.

Responsibilities:

* scan QR code or enter pairing code;
* establish secure connection to active session;
* capture microphone input;
* run speech recognition;
* show transcript before sending, depending on mode;
* send text requests to the active `cya` session;
* display responses;
* optionally read responses aloud;
* allow pause, mute, reconnect, and disconnect.

The app does not need to know the details of the user’s filesystem.

---

### 3. Session Broker

A broker is needed to connect the phone to the CLI session.

This could be implemented in different ways.

#### Option A: Local Network Broker

The CLI starts a small local WebSocket server.

The phone connects directly over the LAN.

Good for:

* home networks;
* office networks;
* local development;
* no cloud dependency.

Challenges:

* NAT/firewall issues;
* phone and terminal must be on same network;
* HTTPS/TLS handling;
* service discovery.

Example:

```bash
cya talk --listen 192.168.1.42:47391
```

QR code contains:

```text
cya://pair?host=192.168.1.42&port=47391&token=...
```

---

#### Option B: Relay Broker

A small relay service connects phone and CLI.

Good for:

* SSH sessions;
* cloud servers;
* NAT traversal;
* mobile networks;
* remote development.

The relay does not need to see filesystem data if messages are end-to-end encrypted or if the relay only routes session traffic.

QR code contains:

```text
cya://pair?relay=https://relay.cya.example&session=...&token=...
```

Challenges:

* requires hosted infrastructure;
* introduces trust and privacy questions;
* should be optional, not mandatory.

---

#### Option C: User-Controlled Broker

Advanced users can self-host the broker.

This fits the philosophy of `cya`.

Possible command:

```bash
cya-broker serve
```

Then:

```bash
cya talk --broker https://my-broker.example
```

This preserves the user-controlled infrastructure principle.

---

## Recommended Implementation Path

A practical implementation path could have four stages.

---

## Stage 1: Text Bridge Prototype

Do not start with speech.

Start with a phone-to-terminal text bridge.

Goal:

* CLI starts a session.
* Phone connects.
* User types text on phone.
* Text appears in the active `cya` session.
* CLI processes request and returns response.

This proves:

* pairing;
* routing;
* session ownership;
* broker model;
* authentication;
* terminal-side context handling.

Example:

```bash
cya talk
```

Phone app:

```text
Connected. Type your request.
```

This avoids early complexity around speech APIs.

---

## Stage 2: Speech-to-Text on Phone

Add microphone and speech recognition.

The phone converts speech to text locally or through platform speech recognition.

Options:

* iOS speech recognition;
* Android speech recognition;
* browser-based Web Speech API for a PWA;
* local speech model on device where feasible;
* cloud speech-to-text if user permits.

The important architectural point:

> `cya` receives text, not raw audio, unless explicitly configured otherwise.

This keeps the CLI helper simple and avoids requiring audio handling on the terminal machine.

---

## Stage 3: Conversation Mode

Add continuous or semi-continuous voice interaction.

Modes:

### Push-to-Talk Mode

Safest and easiest.

User presses and holds a button, speaks, releases, and sends.

Good default.

### Confirm-Before-Send Mode

The app shows recognized text first.

User taps send.

Useful when commands may be risky.

### Continuous Conversation Mode

The app listens continuously and sends utterances automatically.

Useful but riskier.

Should probably be opt-in.

---

## Stage 4: Bidirectional Voice

Add optional text-to-speech responses.

The phone can speak concise answers aloud.

This should be configurable:

```bash
cya talk --phone-speak concise
cya talk --phone-speak off
cya talk --phone-speak full
```

The terminal should still show the full response.

The phone should preferably receive a summarized voice response to avoid reading long command explanations aloud.

---

## Safety Model

Speech mode needs stricter safety defaults than typed CLI use.

Speech recognition can mishear commands.

Therefore:

### Never execute destructive commands directly from speech

For example, if the user says:

> Delete all generated files.

`cya` should produce a preview and require explicit terminal confirmation.

### Require confirmation for execution

Possible pattern:

```text
Suggested command:

rm -rf build/

This may delete files.
Run it? [y/N]
```

The confirmation should happen in the terminal by default, not only on the phone.

### Separate “ask”, “suggest”, and “run”

Useful command modes:

```bash
cya talk --suggest-only
cya talk --allow-run
cya talk --no-exec
```

Default should be:

```bash
cya talk --suggest-only
```

### Show recognized transcript

The phone should always make clear what it heard.

For risky requests, it should require confirmation before sending.

---

## Privacy Model

Speech mode touches sensitive areas:

* spoken input;
* local filesystem context;
* repository contents;
* personal memory;
* command history;
* notes.

So the design should make privacy visible and controllable.

The user should be able to see:

* which phone is connected;
* which terminal session it is connected to;
* what context is being sent to the LLM;
* which LLM backend is used;
* what is stored in memory;
* whether speech recognition is local or cloud-based;
* whether a relay broker is involved.

Example session banner:

```text
CYA voice session active

Phone:
  Bernd's iPhone

Speech recognition:
  on-device if available, platform fallback allowed

LLM backend:
  llm-connect: local-openrouter-default

Memory:
  phase-memory: user + project preferences

Broker:
  local websocket

Execution:
  suggest-only
```

---

## Pairing and Authentication

Pairing should be short-lived and explicit.

Possible pairing methods:

### QR Code

Best default.

```bash
cya talk
```

Terminal displays QR code.

Phone scans it.

### Numeric Code

Fallback for terminals that cannot show QR codes.

```text
Pairing code: 482-119
Expires in 5 minutes.
```

### Known Device Trust

After initial pairing, user may mark a phone as trusted.

Even then, each terminal voice session should be explicitly activated.

Trusted device should not mean always-on access.

---

## Session Ownership

Each voice interaction should be bound to a specific active `cya` session.

That session has:

* session ID;
* current working directory;
* user identity;
* host identity;
* start time;
* allowed capabilities;
* selected LLM backend;
* selected memory scope;
* execution policy.

This prevents accidental cross-talk where the phone sends a request to the wrong terminal or wrong repository.

Useful phone display:

```text
Connected to:
tinker-base
/home/bernd/repos/can-you-assist

Mode:
suggest-only
```

---

## Possible CLI Commands

### Start voice session

```bash
cya talk
```

### Start voice session for current repository

```bash
cya talk --repo
```

### Start with local broker

```bash
cya talk --broker local
```

### Start with relay broker

```bash
cya talk --broker relay
```

### Start with strict safety

```bash
cya talk --suggest-only
```

### Start with no memory

```bash
cya talk --no-memory
```

### Start with project memory

```bash
cya talk --memory project
```

### List active sessions

```bash
cya sessions
```

### End current session

```bash
cya session stop
```

---

## Example User Experience

Terminal:

```bash
cd ~/repos/can-you-assist
cya talk
```

Terminal output:

```text
CYA voice session started for:

  ~/repos/can-you-assist

Mode:
  suggest-only

Pair your phone:
  scan QR code or enter code 913-442

Waiting for phone...
```

Phone:

```text
Connected to can-you-assist on tinker-base.

Tap and speak.
```

User speaks:

> What would be a good initial module structure for this project?

Terminal:

```text
You asked:
"What would be a good initial module structure for this project?"

Suggested initial structure:

src/
  cli/
    main.ts
    commands/
  session/
    voice-session.ts
    pairing.ts
  context/
    collector.ts
    git.ts
    filesystem.ts
  llm/
    llm-connect-client.ts
  memory/
    phase-memory-client.ts
  safety/
    classifier.ts
    confirmation.ts

Rationale:
...
```

Phone:

```text
I suggested a module structure with CLI, session, context, LLM, memory, and safety layers.
```

---

## Integration With `llm-connect`

Speech mode should not bypass `llm-connect`.

The flow should be:

```text
Phone speech
→ transcript
→ cya CLI session
→ llm-connect
→ selected LLM backend
→ cya response
→ terminal and/or phone
```

This keeps backend selection consistent with normal `cya` behavior.

The phone app should not directly call the LLM unless specifically configured for a future lightweight mode.

---

## Integration With `phase-memory`

Speech mode should use `phase-memory` for:

* preferred speech interaction style;
* trusted phones;
* known devices;
* user voice mode preferences;
* project-specific command conventions;
* common workflows;
* interaction summaries.

Examples of remembered preferences:

```text
User prefers speech responses to be concise.
User wants destructive commands previewed but never executed automatically.
User usually works with git commit messages in conventional commit style.
User prefers explanations before suggested shell commands.
```

Memory should remain inspectable and editable.

Possible commands:

```bash
cya memory show
cya memory edit
cya memory forget voice.devices
```

---

## Minimal Technical Design

A minimal first implementation could use:

### CLI Side

* `cya talk` starts a WebSocket server or connects to a broker.
* It creates a session token.
* It renders a QR code.
* It listens for incoming text messages.
* It routes messages through the normal `cya` assistant pipeline.

### Phone Side

A Progressive Web App may be enough initially.

The PWA can:

* scan QR codes;
* use browser speech recognition where available;
* send text over WebSocket;
* show responses.

This avoids building native iOS and Android apps immediately.

Later, native apps can provide better:

* speech recognition;
* background handling;
* push-to-talk UX;
* device trust;
* text-to-speech;
* secure local storage.

### Broker Side

For the first version, choose one:

#### Local-only prototype

Simpler, private, no cloud.

Good for proof of concept.

#### Minimal relay prototype

More useful for SSH and remote development.

Better real-world fit.

A good architectural compromise:

* implement a broker interface;
* start with local broker;
* allow relay broker later;
* keep protocol stable.

---

## Protocol Sketch

Phone sends:

```json
{
  "type": "user_message",
  "session_id": "cya_sess_123",
  "message_id": "msg_001",
  "input_mode": "speech",
  "transcript": "Explain the current git status and suggest a commit message.",
  "confidence": 0.91
}
```

CLI responds:

```json
{
  "type": "assistant_response",
  "session_id": "cya_sess_123",
  "message_id": "msg_001",
  "terminal_response": "...full response...",
  "phone_response": "I found three modified files and one new file. Suggested commit message: ...",
  "requires_confirmation": false
}
```

For risky actions:

```json
{
  "type": "assistant_response",
  "session_id": "cya_sess_123",
  "message_id": "msg_002",
  "terminal_response": "Suggested command: rm -rf build/",
  "phone_response": "This may delete files. Please confirm in the terminal.",
  "requires_confirmation": true,
  "confirmation_channel": "terminal"
}
```

---

## Important Product Distinction

This extension should not be framed as:

> “CYA becomes a mobile assistant.”

It should be framed as:

> “The phone becomes a microphone and conversation surface for the active terminal helper.”

That distinction protects the project scope.

The center remains the console.

---

## Updated INTENT Addition

This could be added to the `INTENT.md` under long-term direction or primary use cases:

```markdown
### Speech-Assisted Console Interaction

`cya` should eventually support a speech interaction mode where a phone or other capable device can act as a microphone, speech-recognition frontend, and lightweight conversation surface for an active `cya` CLI session.

This enables voice interaction even when the terminal environment itself has no microphone, audio stack, graphical interface, or speech-recognition capability.

In this mode, the phone does not become the primary execution environment. Instead, it connects to a currently activated `cya` helper session. The CLI session remains responsible for local filesystem context, repository context, memory scope, LLM backend selection, and safety confirmation.

The phone provides convenient speech input and optional spoken output, while `cya` preserves its console-native, user-controlled architecture.
```

---

## My Recommended Direction

I would treat this as a distinct but natural extension:

```text
cya-core
  console assistant

cya-voice
  speech bridge protocol and session mode

cya-mobile
  phone app or PWA

cya-broker
  optional pairing and relay service
```

The first implementation should probably be:

```text
cya talk
+ local WebSocket session
+ QR pairing
+ phone PWA
+ push-to-talk speech recognition
+ suggest-only mode
```

That gives you the core magic without overbuilding the system.

The deeper architectural insight is:

> Speech mode should not make the terminal speak.
> It should let a speech-capable companion device speak *to the terminal’s active assistant context*.