Files
can-you-assist/wiki/CyaSpeechModeExtension.md

19 KiB
Raw Blame History

CYA Speech Mode Extension

Working Name

cya speech mode

Possible short names:

  • cya talk
  • cya voice
  • cya listen
  • cya pair
  • cya phone

Core Idea

cya should support a speech interaction mode where the user can use a phone as the microphone, speech-recognition device, and conversational front-end for a cya helper session running on a terminal.

This allows users to speak to cya even when the active terminal environment has no usable microphone, no audio stack, no GUI, or no convenient speech-recognition capability.

The terminal remains the operational context.

The phone becomes the voice interface.

The LLM interaction remains bound to the currently activated cya session.


User Scenario

A user is working in a terminal on a server, VM, SSH session, minimal Linux installation, or development machine.

They want to ask:

“What does this error mean?”
“Generate a command to find all Python files importing requests.”
“Explain the current git status.”
“Create a commit message for these changes.”
“Summarize the files in this folder.”

Instead of typing the request, they activate a cya helper session in the terminal and connect the phone app.

The phone enters conversation mode and uses its microphone and speech recognition.

The recognized text is sent to the active cya session.

The terminal-side cya helper answers in the terminal, and optionally also sends a spoken or textual response back to the phone.


Basic Interaction Flow

1. User activates CYA in terminal

Example:

cya talk

or:

cya --voice

The CLI starts a local or remote helper session and displays a pairing instruction.

Example:

CYA voice session started.

Pair with phone:
  Open the CYA mobile app and scan this QR code.

Session:
  repo: /home/bernd/projects/example
  mode: voice bridge
  expires: 5 minutes

The terminal may show a QR code containing a short-lived pairing token.


2. User opens the CYA phone app

The app scans the QR code or enters a pairing code.

The phone is now connected to the active cya session.

The app shows:

Connected to:
example repo on tinker-base

Listening...

3. User speaks into the phone

The phone captures audio and performs speech recognition.

The recognized text is sent to the active terminal-side cya helper.

Example recognized utterance:

Please explain the current git status and suggest a commit message.

4. Terminal-side CYA processes the request

The terminal-side cya helper has access to the local context:

  • current working directory;
  • selected files;
  • git status;
  • shell environment;
  • project memory;
  • user preferences;
  • configured LLM backend through llm-connect;
  • memory through phase-memory.

The phone does not need direct access to the filesystem.

That is important.

The phone provides speech input.

The CLI session owns the local operational context.


5. Response is returned

The response appears in the terminal.

Optionally, a concise response is also sent back to the phone.

Example terminal response:

Current git status summary:

- 3 modified files
- 1 new file
- no staged changes

Suggested commit message:

feat(cli): add initial voice session pairing flow

Suggested next command:

git add src/voice-session.ts README.md
git commit -m "feat(cli): add initial voice session pairing flow"

The phone may show:

I found three modified files and one new file. Suggested commit message:
feat(cli): add initial voice session pairing flow

Optionally, the phone can read that aloud.


Key Design Principle

The phone should be a voice bridge, not the primary authority.

The terminal session remains the source of truth for:

  • current working directory;
  • filesystem access;
  • repository context;
  • execution permissions;
  • local configuration;
  • project memory;
  • safety confirmations.

The phone handles:

  • microphone input;
  • speech recognition;
  • maybe text-to-speech;
  • session pairing;
  • conversational convenience.

This avoids turning the phone app into a remote filesystem client or full IDE.


Conceptual Architecture

+------------------+          +----------------------+          +------------------+
|                  | speech   |                      | text     |                  |
|   Phone App      +---------->  CYA Session Broker   +---------->  CYA CLI Session |
|                  | text     |                      | result   |                  |
+--------+---------+<----------+----------+-----------+<----------+--------+---------+
         |                               |                              |
         |                               |                              |
         v                               v                              v
  Microphone / STT               Pairing / Routing              Local Context
  Text-to-Speech                 Session Registry               Filesystem
  Conversation UI                Authentication                 Git / Shell
                                                                  llm-connect
                                                                  phase-memory

Main Components

1. CYA CLI Voice Session

The CLI needs a mode that creates an active voice-addressable session.

Responsibilities:

  • create session ID;
  • generate short-lived pairing token;
  • register itself with a broker;
  • expose current terminal context;
  • receive transcribed user messages;
  • process messages through normal cya assistance flow;
  • return responses;
  • enforce safety and confirmation rules.

Possible command:

cya talk

or:

cya session start --voice

2. CYA Phone App

The phone app provides the user-facing speech interface.

Responsibilities:

  • scan QR code or enter pairing code;
  • establish secure connection to active session;
  • capture microphone input;
  • run speech recognition;
  • show transcript before sending, depending on mode;
  • send text requests to the active cya session;
  • display responses;
  • optionally read responses aloud;
  • allow pause, mute, reconnect, and disconnect.

The app does not need to know the details of the users filesystem.


3. Session Broker

A broker is needed to connect the phone to the CLI session.

This could be implemented in different ways.

Option A: Local Network Broker

The CLI starts a small local WebSocket server.

The phone connects directly over the LAN.

Good for:

  • home networks;
  • office networks;
  • local development;
  • no cloud dependency.

Challenges:

  • NAT/firewall issues;
  • phone and terminal must be on same network;
  • HTTPS/TLS handling;
  • service discovery.

Example:

cya talk --listen 192.168.1.42:47391

QR code contains:

cya://pair?host=192.168.1.42&port=47391&token=...

Option B: Relay Broker

A small relay service connects phone and CLI.

Good for:

  • SSH sessions;
  • cloud servers;
  • NAT traversal;
  • mobile networks;
  • remote development.

The relay does not need to see filesystem data if messages are end-to-end encrypted or if the relay only routes session traffic.

QR code contains:

cya://pair?relay=https://relay.cya.example&session=...&token=...

Challenges:

  • requires hosted infrastructure;
  • introduces trust and privacy questions;
  • should be optional, not mandatory.

Option C: User-Controlled Broker

Advanced users can self-host the broker.

This fits the philosophy of cya.

Possible command:

cya-broker serve

Then:

cya talk --broker https://my-broker.example

This preserves the user-controlled infrastructure principle.


A practical implementation path could have four stages.


Stage 1: Text Bridge Prototype

Do not start with speech.

Start with a phone-to-terminal text bridge.

Goal:

  • CLI starts a session.
  • Phone connects.
  • User types text on phone.
  • Text appears in the active cya session.
  • CLI processes request and returns response.

This proves:

  • pairing;
  • routing;
  • session ownership;
  • broker model;
  • authentication;
  • terminal-side context handling.

Example:

cya talk

Phone app:

Connected. Type your request.

This avoids early complexity around speech APIs.


Stage 2: Speech-to-Text on Phone

Add microphone and speech recognition.

The phone converts speech to text locally or through platform speech recognition.

Options:

  • iOS speech recognition;
  • Android speech recognition;
  • browser-based Web Speech API for a PWA;
  • local speech model on device where feasible;
  • cloud speech-to-text if user permits.

The important architectural point:

cya receives text, not raw audio, unless explicitly configured otherwise.

This keeps the CLI helper simple and avoids requiring audio handling on the terminal machine.


Stage 3: Conversation Mode

Add continuous or semi-continuous voice interaction.

Modes:

Push-to-Talk Mode

Safest and easiest.

User presses and holds a button, speaks, releases, and sends.

Good default.

Confirm-Before-Send Mode

The app shows recognized text first.

User taps send.

Useful when commands may be risky.

Continuous Conversation Mode

The app listens continuously and sends utterances automatically.

Useful but riskier.

Should probably be opt-in.


Stage 4: Bidirectional Voice

Add optional text-to-speech responses.

The phone can speak concise answers aloud.

This should be configurable:

cya talk --phone-speak concise
cya talk --phone-speak off
cya talk --phone-speak full

The terminal should still show the full response.

The phone should preferably receive a summarized voice response to avoid reading long command explanations aloud.


Safety Model

Speech mode needs stricter safety defaults than typed CLI use.

Speech recognition can mishear commands.

Therefore:

Never execute destructive commands directly from speech

For example, if the user says:

Delete all generated files.

cya should produce a preview and require explicit terminal confirmation.

Require confirmation for execution

Possible pattern:

Suggested command:

rm -rf build/

This may delete files.
Run it? [y/N]

The confirmation should happen in the terminal by default, not only on the phone.

Separate “ask”, “suggest”, and “run”

Useful command modes:

cya talk --suggest-only
cya talk --allow-run
cya talk --no-exec

Default should be:

cya talk --suggest-only

Show recognized transcript

The phone should always make clear what it heard.

For risky requests, it should require confirmation before sending.


Privacy Model

Speech mode touches sensitive areas:

  • spoken input;
  • local filesystem context;
  • repository contents;
  • personal memory;
  • command history;
  • notes.

So the design should make privacy visible and controllable.

The user should be able to see:

  • which phone is connected;
  • which terminal session it is connected to;
  • what context is being sent to the LLM;
  • which LLM backend is used;
  • what is stored in memory;
  • whether speech recognition is local or cloud-based;
  • whether a relay broker is involved.

Example session banner:

CYA voice session active

Phone:
  Bernd's iPhone

Speech recognition:
  on-device if available, platform fallback allowed

LLM backend:
  llm-connect: local-openrouter-default

Memory:
  phase-memory: user + project preferences

Broker:
  local websocket

Execution:
  suggest-only

Pairing and Authentication

Pairing should be short-lived and explicit.

Possible pairing methods:

QR Code

Best default.

cya talk

Terminal displays QR code.

Phone scans it.

Numeric Code

Fallback for terminals that cannot show QR codes.

Pairing code: 482-119
Expires in 5 minutes.

Known Device Trust

After initial pairing, user may mark a phone as trusted.

Even then, each terminal voice session should be explicitly activated.

Trusted device should not mean always-on access.


Session Ownership

Each voice interaction should be bound to a specific active cya session.

That session has:

  • session ID;
  • current working directory;
  • user identity;
  • host identity;
  • start time;
  • allowed capabilities;
  • selected LLM backend;
  • selected memory scope;
  • execution policy.

This prevents accidental cross-talk where the phone sends a request to the wrong terminal or wrong repository.

Useful phone display:

Connected to:
tinker-base
/home/bernd/repos/can-you-assist

Mode:
suggest-only

Possible CLI Commands

Start voice session

cya talk

Start voice session for current repository

cya talk --repo

Start with local broker

cya talk --broker local

Start with relay broker

cya talk --broker relay

Start with strict safety

cya talk --suggest-only

Start with no memory

cya talk --no-memory

Start with project memory

cya talk --memory project

List active sessions

cya sessions

End current session

cya session stop

Example User Experience

Terminal:

cd ~/repos/can-you-assist
cya talk

Terminal output:

CYA voice session started for:

  ~/repos/can-you-assist

Mode:
  suggest-only

Pair your phone:
  scan QR code or enter code 913-442

Waiting for phone...

Phone:

Connected to can-you-assist on tinker-base.

Tap and speak.

User speaks:

What would be a good initial module structure for this project?

Terminal:

You asked:
"What would be a good initial module structure for this project?"

Suggested initial structure:

src/
  cli/
    main.ts
    commands/
  session/
    voice-session.ts
    pairing.ts
  context/
    collector.ts
    git.ts
    filesystem.ts
  llm/
    llm-connect-client.ts
  memory/
    phase-memory-client.ts
  safety/
    classifier.ts
    confirmation.ts

Rationale:
...

Phone:

I suggested a module structure with CLI, session, context, LLM, memory, and safety layers.

Integration With llm-connect

Speech mode should not bypass llm-connect.

The flow should be:

Phone speech
→ transcript
→ cya CLI session
→ llm-connect
→ selected LLM backend
→ cya response
→ terminal and/or phone

This keeps backend selection consistent with normal cya behavior.

The phone app should not directly call the LLM unless specifically configured for a future lightweight mode.


Integration With phase-memory

Speech mode should use phase-memory for:

  • preferred speech interaction style;
  • trusted phones;
  • known devices;
  • user voice mode preferences;
  • project-specific command conventions;
  • common workflows;
  • interaction summaries.

Examples of remembered preferences:

User prefers speech responses to be concise.
User wants destructive commands previewed but never executed automatically.
User usually works with git commit messages in conventional commit style.
User prefers explanations before suggested shell commands.

Memory should remain inspectable and editable.

Possible commands:

cya memory show
cya memory edit
cya memory forget voice.devices

Minimal Technical Design

A minimal first implementation could use:

CLI Side

  • cya talk starts a WebSocket server or connects to a broker.
  • It creates a session token.
  • It renders a QR code.
  • It listens for incoming text messages.
  • It routes messages through the normal cya assistant pipeline.

Phone Side

A Progressive Web App may be enough initially.

The PWA can:

  • scan QR codes;
  • use browser speech recognition where available;
  • send text over WebSocket;
  • show responses.

This avoids building native iOS and Android apps immediately.

Later, native apps can provide better:

  • speech recognition;
  • background handling;
  • push-to-talk UX;
  • device trust;
  • text-to-speech;
  • secure local storage.

Broker Side

For the first version, choose one:

Local-only prototype

Simpler, private, no cloud.

Good for proof of concept.

Minimal relay prototype

More useful for SSH and remote development.

Better real-world fit.

A good architectural compromise:

  • implement a broker interface;
  • start with local broker;
  • allow relay broker later;
  • keep protocol stable.

Protocol Sketch

Phone sends:

{
  "type": "user_message",
  "session_id": "cya_sess_123",
  "message_id": "msg_001",
  "input_mode": "speech",
  "transcript": "Explain the current git status and suggest a commit message.",
  "confidence": 0.91
}

CLI responds:

{
  "type": "assistant_response",
  "session_id": "cya_sess_123",
  "message_id": "msg_001",
  "terminal_response": "...full response...",
  "phone_response": "I found three modified files and one new file. Suggested commit message: ...",
  "requires_confirmation": false
}

For risky actions:

{
  "type": "assistant_response",
  "session_id": "cya_sess_123",
  "message_id": "msg_002",
  "terminal_response": "Suggested command: rm -rf build/",
  "phone_response": "This may delete files. Please confirm in the terminal.",
  "requires_confirmation": true,
  "confirmation_channel": "terminal"
}

Important Product Distinction

This extension should not be framed as:

“CYA becomes a mobile assistant.”

It should be framed as:

“The phone becomes a microphone and conversation surface for the active terminal helper.”

That distinction protects the project scope.

The center remains the console.


Updated INTENT Addition

This could be added to the INTENT.md under long-term direction or primary use cases:

### Speech-Assisted Console Interaction

`cya` should eventually support a speech interaction mode where a phone or other capable device can act as a microphone, speech-recognition frontend, and lightweight conversation surface for an active `cya` CLI session.

This enables voice interaction even when the terminal environment itself has no microphone, audio stack, graphical interface, or speech-recognition capability.

In this mode, the phone does not become the primary execution environment. Instead, it connects to a currently activated `cya` helper session. The CLI session remains responsible for local filesystem context, repository context, memory scope, LLM backend selection, and safety confirmation.

The phone provides convenient speech input and optional spoken output, while `cya` preserves its console-native, user-controlled architecture.

I would treat this as a distinct but natural extension:

cya-core
  console assistant

cya-voice
  speech bridge protocol and session mode

cya-mobile
  phone app or PWA

cya-broker
  optional pairing and relay service

The first implementation should probably be:

cya talk
+ local WebSocket session
+ QR pairing
+ phone PWA
+ push-to-talk speech recognition
+ suggest-only mode

That gives you the core magic without overbuilding the system.

The deeper architectural insight is:

Speech mode should not make the terminal speak. It should let a speech-capable companion device speak to the terminals active assistant context.