19 KiB
CYA Speech Mode Extension
Working Name
cya speech mode
Possible short names:
cya talkcya voicecya listencya paircya phone
Core Idea
cya should support a speech interaction mode where the user can use a phone as the microphone, speech-recognition device, and conversational front-end for a cya helper session running on a terminal.
This allows users to speak to cya even when the active terminal environment has no usable microphone, no audio stack, no GUI, or no convenient speech-recognition capability.
The terminal remains the operational context.
The phone becomes the voice interface.
The LLM interaction remains bound to the currently activated cya session.
User Scenario
A user is working in a terminal on a server, VM, SSH session, minimal Linux installation, or development machine.
They want to ask:
“What does this error mean?”
“Generate a command to find all Python files importing requests.”
“Explain the current git status.”
“Create a commit message for these changes.”
“Summarize the files in this folder.”
Instead of typing the request, they activate a cya helper session in the terminal and connect the phone app.
The phone enters conversation mode and uses its microphone and speech recognition.
The recognized text is sent to the active cya session.
The terminal-side cya helper answers in the terminal, and optionally also sends a spoken or textual response back to the phone.
Basic Interaction Flow
1. User activates CYA in terminal
Example:
cya talk
or:
cya --voice
The CLI starts a local or remote helper session and displays a pairing instruction.
Example:
CYA voice session started.
Pair with phone:
Open the CYA mobile app and scan this QR code.
Session:
repo: /home/bernd/projects/example
mode: voice bridge
expires: 5 minutes
The terminal may show a QR code containing a short-lived pairing token.
2. User opens the CYA phone app
The app scans the QR code or enters a pairing code.
The phone is now connected to the active cya session.
The app shows:
Connected to:
example repo on tinker-base
Listening...
3. User speaks into the phone
The phone captures audio and performs speech recognition.
The recognized text is sent to the active terminal-side cya helper.
Example recognized utterance:
Please explain the current git status and suggest a commit message.
4. Terminal-side CYA processes the request
The terminal-side cya helper has access to the local context:
- current working directory;
- selected files;
- git status;
- shell environment;
- project memory;
- user preferences;
- configured LLM backend through
llm-connect; - memory through
phase-memory.
The phone does not need direct access to the filesystem.
That is important.
The phone provides speech input.
The CLI session owns the local operational context.
5. Response is returned
The response appears in the terminal.
Optionally, a concise response is also sent back to the phone.
Example terminal response:
Current git status summary:
- 3 modified files
- 1 new file
- no staged changes
Suggested commit message:
feat(cli): add initial voice session pairing flow
Suggested next command:
git add src/voice-session.ts README.md
git commit -m "feat(cli): add initial voice session pairing flow"
The phone may show:
I found three modified files and one new file. Suggested commit message:
feat(cli): add initial voice session pairing flow
Optionally, the phone can read that aloud.
Key Design Principle
The phone should be a voice bridge, not the primary authority.
The terminal session remains the source of truth for:
- current working directory;
- filesystem access;
- repository context;
- execution permissions;
- local configuration;
- project memory;
- safety confirmations.
The phone handles:
- microphone input;
- speech recognition;
- maybe text-to-speech;
- session pairing;
- conversational convenience.
This avoids turning the phone app into a remote filesystem client or full IDE.
Conceptual Architecture
+------------------+ +----------------------+ +------------------+
| | speech | | text | |
| Phone App +----------> CYA Session Broker +----------> CYA CLI Session |
| | text | | result | |
+--------+---------+<----------+----------+-----------+<----------+--------+---------+
| | |
| | |
v v v
Microphone / STT Pairing / Routing Local Context
Text-to-Speech Session Registry Filesystem
Conversation UI Authentication Git / Shell
llm-connect
phase-memory
Main Components
1. CYA CLI Voice Session
The CLI needs a mode that creates an active voice-addressable session.
Responsibilities:
- create session ID;
- generate short-lived pairing token;
- register itself with a broker;
- expose current terminal context;
- receive transcribed user messages;
- process messages through normal
cyaassistance flow; - return responses;
- enforce safety and confirmation rules.
Possible command:
cya talk
or:
cya session start --voice
2. CYA Phone App
The phone app provides the user-facing speech interface.
Responsibilities:
- scan QR code or enter pairing code;
- establish secure connection to active session;
- capture microphone input;
- run speech recognition;
- show transcript before sending, depending on mode;
- send text requests to the active
cyasession; - display responses;
- optionally read responses aloud;
- allow pause, mute, reconnect, and disconnect.
The app does not need to know the details of the user’s filesystem.
3. Session Broker
A broker is needed to connect the phone to the CLI session.
This could be implemented in different ways.
Option A: Local Network Broker
The CLI starts a small local WebSocket server.
The phone connects directly over the LAN.
Good for:
- home networks;
- office networks;
- local development;
- no cloud dependency.
Challenges:
- NAT/firewall issues;
- phone and terminal must be on same network;
- HTTPS/TLS handling;
- service discovery.
Example:
cya talk --listen 192.168.1.42:47391
QR code contains:
cya://pair?host=192.168.1.42&port=47391&token=...
Option B: Relay Broker
A small relay service connects phone and CLI.
Good for:
- SSH sessions;
- cloud servers;
- NAT traversal;
- mobile networks;
- remote development.
The relay does not need to see filesystem data if messages are end-to-end encrypted or if the relay only routes session traffic.
QR code contains:
cya://pair?relay=https://relay.cya.example&session=...&token=...
Challenges:
- requires hosted infrastructure;
- introduces trust and privacy questions;
- should be optional, not mandatory.
Option C: User-Controlled Broker
Advanced users can self-host the broker.
This fits the philosophy of cya.
Possible command:
cya-broker serve
Then:
cya talk --broker https://my-broker.example
This preserves the user-controlled infrastructure principle.
Recommended Implementation Path
A practical implementation path could have four stages.
Stage 1: Text Bridge Prototype
Do not start with speech.
Start with a phone-to-terminal text bridge.
Goal:
- CLI starts a session.
- Phone connects.
- User types text on phone.
- Text appears in the active
cyasession. - CLI processes request and returns response.
This proves:
- pairing;
- routing;
- session ownership;
- broker model;
- authentication;
- terminal-side context handling.
Example:
cya talk
Phone app:
Connected. Type your request.
This avoids early complexity around speech APIs.
Stage 2: Speech-to-Text on Phone
Add microphone and speech recognition.
The phone converts speech to text locally or through platform speech recognition.
Options:
- iOS speech recognition;
- Android speech recognition;
- browser-based Web Speech API for a PWA;
- local speech model on device where feasible;
- cloud speech-to-text if user permits.
The important architectural point:
cyareceives text, not raw audio, unless explicitly configured otherwise.
This keeps the CLI helper simple and avoids requiring audio handling on the terminal machine.
Stage 3: Conversation Mode
Add continuous or semi-continuous voice interaction.
Modes:
Push-to-Talk Mode
Safest and easiest.
User presses and holds a button, speaks, releases, and sends.
Good default.
Confirm-Before-Send Mode
The app shows recognized text first.
User taps send.
Useful when commands may be risky.
Continuous Conversation Mode
The app listens continuously and sends utterances automatically.
Useful but riskier.
Should probably be opt-in.
Stage 4: Bidirectional Voice
Add optional text-to-speech responses.
The phone can speak concise answers aloud.
This should be configurable:
cya talk --phone-speak concise
cya talk --phone-speak off
cya talk --phone-speak full
The terminal should still show the full response.
The phone should preferably receive a summarized voice response to avoid reading long command explanations aloud.
Safety Model
Speech mode needs stricter safety defaults than typed CLI use.
Speech recognition can mishear commands.
Therefore:
Never execute destructive commands directly from speech
For example, if the user says:
Delete all generated files.
cya should produce a preview and require explicit terminal confirmation.
Require confirmation for execution
Possible pattern:
Suggested command:
rm -rf build/
This may delete files.
Run it? [y/N]
The confirmation should happen in the terminal by default, not only on the phone.
Separate “ask”, “suggest”, and “run”
Useful command modes:
cya talk --suggest-only
cya talk --allow-run
cya talk --no-exec
Default should be:
cya talk --suggest-only
Show recognized transcript
The phone should always make clear what it heard.
For risky requests, it should require confirmation before sending.
Privacy Model
Speech mode touches sensitive areas:
- spoken input;
- local filesystem context;
- repository contents;
- personal memory;
- command history;
- notes.
So the design should make privacy visible and controllable.
The user should be able to see:
- which phone is connected;
- which terminal session it is connected to;
- what context is being sent to the LLM;
- which LLM backend is used;
- what is stored in memory;
- whether speech recognition is local or cloud-based;
- whether a relay broker is involved.
Example session banner:
CYA voice session active
Phone:
Bernd's iPhone
Speech recognition:
on-device if available, platform fallback allowed
LLM backend:
llm-connect: local-openrouter-default
Memory:
phase-memory: user + project preferences
Broker:
local websocket
Execution:
suggest-only
Pairing and Authentication
Pairing should be short-lived and explicit.
Possible pairing methods:
QR Code
Best default.
cya talk
Terminal displays QR code.
Phone scans it.
Numeric Code
Fallback for terminals that cannot show QR codes.
Pairing code: 482-119
Expires in 5 minutes.
Known Device Trust
After initial pairing, user may mark a phone as trusted.
Even then, each terminal voice session should be explicitly activated.
Trusted device should not mean always-on access.
Session Ownership
Each voice interaction should be bound to a specific active cya session.
That session has:
- session ID;
- current working directory;
- user identity;
- host identity;
- start time;
- allowed capabilities;
- selected LLM backend;
- selected memory scope;
- execution policy.
This prevents accidental cross-talk where the phone sends a request to the wrong terminal or wrong repository.
Useful phone display:
Connected to:
tinker-base
/home/bernd/repos/can-you-assist
Mode:
suggest-only
Possible CLI Commands
Start voice session
cya talk
Start voice session for current repository
cya talk --repo
Start with local broker
cya talk --broker local
Start with relay broker
cya talk --broker relay
Start with strict safety
cya talk --suggest-only
Start with no memory
cya talk --no-memory
Start with project memory
cya talk --memory project
List active sessions
cya sessions
End current session
cya session stop
Example User Experience
Terminal:
cd ~/repos/can-you-assist
cya talk
Terminal output:
CYA voice session started for:
~/repos/can-you-assist
Mode:
suggest-only
Pair your phone:
scan QR code or enter code 913-442
Waiting for phone...
Phone:
Connected to can-you-assist on tinker-base.
Tap and speak.
User speaks:
What would be a good initial module structure for this project?
Terminal:
You asked:
"What would be a good initial module structure for this project?"
Suggested initial structure:
src/
cli/
main.ts
commands/
session/
voice-session.ts
pairing.ts
context/
collector.ts
git.ts
filesystem.ts
llm/
llm-connect-client.ts
memory/
phase-memory-client.ts
safety/
classifier.ts
confirmation.ts
Rationale:
...
Phone:
I suggested a module structure with CLI, session, context, LLM, memory, and safety layers.
Integration With llm-connect
Speech mode should not bypass llm-connect.
The flow should be:
Phone speech
→ transcript
→ cya CLI session
→ llm-connect
→ selected LLM backend
→ cya response
→ terminal and/or phone
This keeps backend selection consistent with normal cya behavior.
The phone app should not directly call the LLM unless specifically configured for a future lightweight mode.
Integration With phase-memory
Speech mode should use phase-memory for:
- preferred speech interaction style;
- trusted phones;
- known devices;
- user voice mode preferences;
- project-specific command conventions;
- common workflows;
- interaction summaries.
Examples of remembered preferences:
User prefers speech responses to be concise.
User wants destructive commands previewed but never executed automatically.
User usually works with git commit messages in conventional commit style.
User prefers explanations before suggested shell commands.
Memory should remain inspectable and editable.
Possible commands:
cya memory show
cya memory edit
cya memory forget voice.devices
Minimal Technical Design
A minimal first implementation could use:
CLI Side
cya talkstarts a WebSocket server or connects to a broker.- It creates a session token.
- It renders a QR code.
- It listens for incoming text messages.
- It routes messages through the normal
cyaassistant pipeline.
Phone Side
A Progressive Web App may be enough initially.
The PWA can:
- scan QR codes;
- use browser speech recognition where available;
- send text over WebSocket;
- show responses.
This avoids building native iOS and Android apps immediately.
Later, native apps can provide better:
- speech recognition;
- background handling;
- push-to-talk UX;
- device trust;
- text-to-speech;
- secure local storage.
Broker Side
For the first version, choose one:
Local-only prototype
Simpler, private, no cloud.
Good for proof of concept.
Minimal relay prototype
More useful for SSH and remote development.
Better real-world fit.
A good architectural compromise:
- implement a broker interface;
- start with local broker;
- allow relay broker later;
- keep protocol stable.
Protocol Sketch
Phone sends:
{
"type": "user_message",
"session_id": "cya_sess_123",
"message_id": "msg_001",
"input_mode": "speech",
"transcript": "Explain the current git status and suggest a commit message.",
"confidence": 0.91
}
CLI responds:
{
"type": "assistant_response",
"session_id": "cya_sess_123",
"message_id": "msg_001",
"terminal_response": "...full response...",
"phone_response": "I found three modified files and one new file. Suggested commit message: ...",
"requires_confirmation": false
}
For risky actions:
{
"type": "assistant_response",
"session_id": "cya_sess_123",
"message_id": "msg_002",
"terminal_response": "Suggested command: rm -rf build/",
"phone_response": "This may delete files. Please confirm in the terminal.",
"requires_confirmation": true,
"confirmation_channel": "terminal"
}
Important Product Distinction
This extension should not be framed as:
“CYA becomes a mobile assistant.”
It should be framed as:
“The phone becomes a microphone and conversation surface for the active terminal helper.”
That distinction protects the project scope.
The center remains the console.
Updated INTENT Addition
This could be added to the INTENT.md under long-term direction or primary use cases:
### Speech-Assisted Console Interaction
`cya` should eventually support a speech interaction mode where a phone or other capable device can act as a microphone, speech-recognition frontend, and lightweight conversation surface for an active `cya` CLI session.
This enables voice interaction even when the terminal environment itself has no microphone, audio stack, graphical interface, or speech-recognition capability.
In this mode, the phone does not become the primary execution environment. Instead, it connects to a currently activated `cya` helper session. The CLI session remains responsible for local filesystem context, repository context, memory scope, LLM backend selection, and safety confirmation.
The phone provides convenient speech input and optional spoken output, while `cya` preserves its console-native, user-controlled architecture.
My Recommended Direction
I would treat this as a distinct but natural extension:
cya-core
console assistant
cya-voice
speech bridge protocol and session mode
cya-mobile
phone app or PWA
cya-broker
optional pairing and relay service
The first implementation should probably be:
cya talk
+ local WebSocket session
+ QR pairing
+ phone PWA
+ push-to-talk speech recognition
+ suggest-only mode
That gives you the core magic without overbuilding the system.
The deeper architectural insight is:
Speech mode should not make the terminal speak. It should let a speech-capable companion device speak to the terminal’s active assistant context.