Skip to main content Link Menu Expand (external link) Document Search Copy Copied

日本語

SpeechDock — Advanced Features

This page covers features that require API keys from cloud providers. These are optional enhancements — SpeechDock works fully with macOS native STT/TTS without any API keys.

API Key Setup

API Keys Settings
Settings — API Keys tab for configuring cloud provider credentials

To use cloud providers, configure API keys in Settings > API Keys:

Provider Get API Key Environment Variable
OpenAI OpenAI Platform OPENAI_API_KEY
Google Gemini Google AI Studio GEMINI_API_KEY
ElevenLabs ElevenLabs Settings ELEVENLABS_API_KEY
Grok (xAI) xAI Console GROK_API_KEY

API keys are securely stored in macOS Keychain. Alternatively, you can set environment variables for development.

Cloud STT Providers

Cloud providers offer higher accuracy, more language support, and specialized features compared to macOS native STT.

Provider Models Features
OpenAI GPT-4o Transcribe, GPT-4o Mini Transcribe, Whisper High accuracy, 100+ languages
Google Gemini Gemini 2.5 Flash Native Audio, Gemini 2.0 Flash Live Multimodal, fast
ElevenLabs Scribe v2 Realtime Low latency, natural punctuation
Grok Grok 2 xAI’s realtime transcription

Select the provider in Settings > Speech-to-Text.

Cloud TTS Providers

Cloud TTS provides natural-sounding voices with various styles and languages.

Provider Models Voices
OpenAI GPT-4o Mini TTS, TTS-1, TTS-1 HD alloy, echo, fable, onyx, nova, shimmer
Google Gemini Gemini 2.5 Flash TTS, Gemini 2.5 Pro TTS Multilingual voices
ElevenLabs Eleven v3, Eleven Flash v2.5, Eleven Multilingual v2, Eleven Turbo v2.5 Large voice library
Grok Grok 2 Clio, Sage, Charon, Fenrir, Leda

Voice and Model Selection

Each provider offers different voices and models. Select them in:

  • Settings > Text-to-Speech (persistent setting)
  • TTS Panel header (quick switch)

Audio Output Device

Route TTS playback to any audio output device (speakers, headphones, virtual devices). Select in Settings > Text-to-Speech or the TTS panel.

Audio File Transcription

File Transcription
File Transcription — Drag and drop audio files to transcribe

Transcribe pre-recorded audio files. Available with cloud STT providers and macOS native (macOS 26+). Not available with Grok provider.

Provider Formats Max Size Max Duration API
macOS (26+) MP3, WAV, M4A, AAC, AIFF, FLAC, MP4 500 MB No limit SpeechAnalyzer (offline)
OpenAI MP3, WAV, M4A, FLAC, WebM, MP4 25 MB Unlimited Whisper
Gemini MP3, WAV, AAC, OGG, FLAC 20 MB ~10 min generateContent
ElevenLabs MP3, WAV, M4A, OGG, FLAC 25 MB ~2 hours Scribe v2

Note: macOS native file transcription requires macOS 26 or later and processes audio entirely on-device — no API key or internet connection needed.

How to Transcribe

Drag & Drop: Drag an audio file onto the STT panel’s text area.

Menu Bar: Select Transcribe Audio File… from the SpeechDock menu bar.

The STT panel placeholder displays the supported formats and limits for the currently selected provider.

Translation with External Providers

While macOS on-device translation supports ~18 languages, cloud providers offer:

  • 25+ languages (all languages in the language list)
  • Higher translation quality using LLMs
  • Works on macOS 14+ (no macOS 26 requirement)

Translation Providers and Models

Provider Models Notes
macOS (default) System On-device, no API key, macOS 26+
OpenAI GPT-5 Nano (default), GPT-5 Mini, GPT-5.2 Fast, high quality
Gemini Gemini 3 Flash (default), Gemini 3 Pro Fast, multilingual
Grok Grok 3 Fast (default), Grok 3 Mini Fast Fast translation

Switching Translation Provider

  • Settings > Translation: Set the default provider and model
  • Panel: Click the button next to the translation controls for quick switching

Provider Auto-Sync

When you switch STT or TTS providers, the translation provider automatically syncs:

STT/TTS Provider Translation Provider
OpenAI OpenAI
Gemini Gemini
Grok Grok
ElevenLabs / macOS macOS

Subtitle Real-time Translation

When using subtitle mode, you can enable real-time translation that translates speech as you speak. This works with all audio sources (microphone, system audio, app audio).

How It Works

  1. Enable subtitle mode (Ctrl + Option + S)
  2. Click the globe icon (🌐) in the subtitle header to enable translation
  3. Select your target language and translation provider
  4. Start recording — translations appear in real-time

Translation Providers for Subtitles

Provider Debounce Best For
macOS 300ms Fast, local, privacy-focused
OpenAI 800ms High quality, many languages
Gemini 600ms Good balance of speed and quality
Grok 800ms Fast translation

Note: Subtitle translation uses the provider’s default model for optimal performance. This is independent of the model selected in the panel translation settings.

Features

  • Caching — Repeated phrases are translated instantly from cache (up to 200 entries)
  • Context-aware — LLM providers use recent sentences as context for better translations
  • Pause detection — Automatically triggers translation after 1.5 seconds of silence
  • Settings sync — Translation settings sync from the STT panel when subtitle mode starts

Limitations

  • Translation adds some latency compared to transcription-only mode
  • Cloud providers require API keys and internet connection
  • macOS provider requires macOS 26+ and downloaded language packs

Language Selection

Both STT and TTS support language selection with all cloud providers:

  • Auto (default): Automatically detects the spoken/target language
  • Manual: Choose from 25+ supported languages

Available languages: English, Japanese, Chinese, Korean, Spanish, French, German, Italian, Portuguese, Russian, Arabic, Hindi, Dutch, Polish, Turkish, Indonesian, Vietnamese, Thai, Bengali, Gujarati, Kannada, Malayalam, Marathi, Tamil, Telugu.

TTS Speed Control (Save Audio)

When saving audio to a file, speed is controlled differently from real-time playback:

Provider Parameter Range Notes
OpenAI speed 0.25–4.0 TTS-1/TTS-1 HD only
ElevenLabs voice_settings.speed 0.7–1.2 Mapped from app range
Gemini Text instruction N/A Natural language pace instruction
macOS Words per minute 50–500 Based on 175 wpm baseline
Grok Speed parameter not supported

For real-time playback, speed is always controlled locally via audio processing, allowing dynamic adjustment during playback.

Privacy Considerations

When using cloud providers:

  • Audio data is sent to the respective provider’s API for processing
  • Each provider has its own privacy policy and data retention rules
  • For maximum privacy, use macOS native providers (all processing on-device)
  • API keys are stored in macOS Keychain and never shared between providers