Local-first speech-to-text CLI: capture, denoise, transcribe, post-process, clipboard.
Project description
kaiku -- Modular Voice (SST) Pipeline Prototype :)
Record speech, transcribe it, and copy the result to clipboard or a file. Supports cloud and fully local ASR backends, VAD (for background daemons) and toggle (for keyboard shortcuts), noise reduction, speaker diarization (as an ASR backend), and AI post-processing.
Originally forked from the elegant but minimalist speech to clipboard tool asr2clip which is also available from . The word "kaiku" is Finnish for "echo".
Jump to the Related projects section at the end to understand the landscape of ASR related tooling and why this project was developed for the open source community.
TL;DR
Try mock backends first (no API keys, no whisper.cpp build):
pip3 install kaiku
kaiku --download-fixtures # ~1.3 MB: demo WAVs + transcripts; prints config paths
kaiku --generate-config # create config with all backend examples
# paste fixture paths from download output into mock_devices / mock-fwd in config
kaiku --test
kaiku -d mock-jfk -b mock-fwd # both mock audio and mock transcription
Cloud (API) path:
pip3 install kaiku
kaiku --generate-config # create config with all backend examples
kaiku --edit # fill in your API key
kaiku --test # verify
kaiku # record and transcribe
Local offline path — sherpa-onnx with VAD support (model auto-downloads):
pip3 install kaiku[vad]
kaiku --download-model # download SenseVoice model on first use
kaiku --serve & # start local ASR API server
# configure a backend pointing to http://127.0.0.1:8000/v1/ — see Local ASR server below
kaiku --test -b sonnx
kaiku -b sonnx
Local offline path — whisper.cpp (no VAD):
pip3 install kaiku
# build whisper.cpp and download the models you want, then configure it in config
kaiku --generate-config # shows a wcpp backend example
kaiku --test -b wcpp
kaiku -b wcpp
CLI reference
usage: kaiku [-h] [-v] [-q] [-c FILE] [-e] [--generate-config]
[--print-config] [--test] [-x NAME] [--list-devices] [-d DEV]
[-i FILE] [-p NAME] [-b NAME] [-l LANG] [-r] [-C SEC] [-g]
[--serve] [--host HOST] [--port PORT] [--model-dir MODEL_DIR]
[--num-threads NUM_THREADS] [--download-model] [--vad]
[--interval SEC] [--silence-threshold PROB]
[--silence-duration SEC] [-s N] [-P NAME] [-M MODEL] [-o FILE]
[-T NAME] [-z]
Record audio and transcribe to clipboard using ASR API
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-q, --quiet Quiet mode — only output transcription and errors
Setup:
-c FILE, --config FILE
Path to configuration file
-e, --edit Open configuration file in editor (creates default
config if missing)
--generate-config Write config template to
~/.config/kaiku/config.yaml
--print-config Print config template to stdout
--test Test backend connectivity and configured
preprocessors, then exit
-x NAME, --preset NAME
Pipeline preset name (key under 'presets:' in config).
Presets define complete pipelines: ASR backend,
preprocessor, post-processor. Optional if
'default_preset' is set in config; CLI overrides still
work (-b, -p, -P).
Audio:
--list-devices List available audio input devices
-d DEV, --device DEV Audio input device (name, ALSA name, or index).
Overrides config.
-i FILE, --input FILE
Transcribe an existing audio or video file instead of
recording. Supported: wav, mp3, m4a, ogg, flac, aac,
opus, wma, mp4, mov, mkv, webm, avi, flv, mvi
-p NAME, --preprocessor NAME
Audio preprocessor: none, noisereduce, pyrnnoise,
deepfilter. Overrides the preprocessor in the selected
preset.
Transcription:
-b NAME, --backend NAME
ASR backend to use (key under 'asr_backends:' in
config). Overrides the backend in the selected preset.
-l LANG, --language LANG
Language hint for transcription (ISO-639-1, e.g. 'fi',
'en'). Overrides config. Omit to auto-detect.
-r, --robust Robust mode for -i file input: split at silence
boundaries, quality-check chunks, retry failures,
stream output (tail-f friendly).
-C SEC, --chunk-duration SEC
Max chunk duration in seconds for -r/--robust mode
(default: 180)
-g, --toggle Toggle recording: first call starts, second call stops
and transcribes. Designed for keyboard shortcuts.
Local ASR server:
--serve Start the local sherpa-onnx ASR API server
--host HOST Server bind address (default: 127.0.0.1 or
local_asr.host in config)
--port PORT Server bind port (default: 8000)
--model-dir MODEL_DIR
Path to ASR model directory
--num-threads NUM_THREADS
Inference threads (default: 4)
--download-model Download the SenseVoice model and exit
VAD (continuous recording):
--vad Continuous recording with voice activity detection.
Transcribes automatically when silence is detected
after speech. Requires sherpa-onnx: pip3 install
kaiku[vad].
--interval SEC Continuous recording with fixed interval (seconds)
--silence-threshold PROB
VAD speech probability threshold, 0.0-1.0 (default:
0.5)
--silence-duration SEC
Silence duration to trigger transcription (default:
1.5 s)
Diarization:
-s N, --speakers N Speaker count hint for diarization backends (type:
whisperx). Ignored by all other
backends. Selects a diarization backend in your preset
or with -b / --backend; see 'asr_backends:' in config.
If omitted, the backend uses its own default or auto-
detects speaker count.
Post-processing:
-P NAME, --post NAME AI post-processor name (key in 'postprocessors:'
config) or an inline system-prompt string. Requires
'postprocessor_backends:' in config. Overrides the
post-processor in the selected preset.
-M MODEL, --post-model MODEL
AI model used for the post-processing (f. ex. claude-
sonnet-4-6). Overrides the post-processor config for
this run.
Output:
-o FILE, --output FILE
Append transcripts to file
-T NAME, --template NAME
Output template name from 'output_templates:' in
config. Controls what is written to clipboard/-o FILE.
Overrides the template specified in the prompt
definition.
-z, --no-clipboard Do not copy the transcript (or a file path) to the
system clipboard. Stdout and -o output behave as
usual.
Examples:
kaiku --edit # create/open config in editor
kaiku --test # verify backend and preprocessors
kaiku # record, transcribe, copy to clipboard
kaiku --toggle # toggle recording (for keyboard shortcuts)
kaiku --toggle -P solo-restructure # toggle, and produce AI-structured memo
kaiku -i audio.mp3 # transcribe an existing file
kaiku -i m.mp3 -p deepfilter -r # neural denoising + chunked transcription
kaiku -i meeting.m4a -b whisperx -s 3 # speaker diarization, 3-speaker hint
kaiku --serve # start local sherpa-onnx ASR server
kaiku --vad -o meeting.txt # continuous VAD transcription to file
kaiku --interval 60 # fixed-interval continuous recording
See https://github.com/sjjsy/kaiku for full documentation and configuration examples.
Prerequisites
Python 3.8+ and one of:
- Cloud API key (OpenAI Whisper, Groq, SiliconFlow, xinference, or any OpenAI-compatible endpoint)
- Local sherpa-onnx server (
pip3 install kaiku[vad], model auto-downloads) - Local whisper.cpp binary + model file (fully offline, no key needed)
System packages
| Dependency | Purpose | Linux | macOS | Windows |
|---|---|---|---|---|
| ffmpeg | Audio format conversion | apt install ffmpeg |
brew install ffmpeg |
Download |
| PortAudio | Audio recording | apt install libportaudio2 |
brew install portaudio |
Included with sounddevice |
Optional Python extras
| Extra | Install | Purpose |
|---|---|---|
vad |
pip3 install kaiku[vad] |
VAD continuous recording + local sherpa-onnx ASR server |
deepfilter |
pip3 install kaiku[deepfilter] |
DeepFilterNet3 best-quality noise reduction |
noisereduce |
pip3 install kaiku[noisereduce] |
Spectral noise reduction (scipy) |
pyrnnoise |
pip3 install kaiku[pyrnnoise] |
RNNoise GRU noise reduction (scipy) |
enhance |
pip3 install kaiku[enhance] |
All three noise reduction options |
diarize |
pip3 install kaiku[diarize] |
Speaker diarization via WhisperX |
Installation
pip3 install kaiku
# or in an isolated environment
pipx install kaiku
# upgrade
pip3 install --upgrade kaiku
All extras: Noise reduction options + VAD and the local sherpa-onnx ASR server:
pip3 install kaiku[enhance,vad]
Note: Audio preprocessing (-p) is not (yet) applied in VAD/interval continuous mode.
From source
git clone https://github.com/Oaklight/kaiku.git
cd kaiku
pip3 install -e .
Setup
Setup commands manage your configuration file and verify that configured backends and preprocessors are working.
| Flag | Description |
|---|---|
-c FILE |
Path to a specific configuration file |
-e / --edit |
Open config in editor (creates default if missing) |
--generate-config |
Write the annotated config template to ~/.config/kaiku/config.yaml |
--print-config |
Print the config template to stdout |
--test |
Test backend connectivity and preprocessor availability, then exit |
-x NAME / --preset NAME |
Pipeline preset (key under presets:). Optional if default_preset is set in config. |
-q / --quiet |
Suppress informational output; only print the transcript and errors |
Setup commands
kaiku --generate-config # write a fully annotated config with all backend examples
kaiku --edit # create/open config in your default editor
kaiku --print-config # print the annotated template to stdout
kaiku --test # verify backend connectivity and preprocessors
kaiku --test -b wcpp # test a specific backend
Config file
Config file is created at ~/.config/kaiku/config.yaml. Locations searched in order:
./kaiku.conf~/.config/kaiku/config.yaml~/.config/kaiku.conf~/.kaiku.conf
Note: The created config file embeds partially commented-out configuration options along with brief explanations for most features.
See kaiku/kaiku.conf.example in the repo for a complete, current example with all backend and feature documentation.
The config file template is not usable immediately: You must update it based on your setup and needs.
The following sections tackle some of these in more detail where relevant.
Audio
| Flag | Description |
|---|---|
--list-devices |
List available audio input devices with names and indices |
-d DEV / --device DEV |
Audio input device (name, ALSA name, or index). Overrides audio_device in config. |
-p NAME / --preprocessor NAME |
Audio preprocessor: none, noisereduce, pyrnnoise, deepfilter. Overrides config. |
Audio device
kaiku --list-devices # list available input devices with names and indices
kaiku -d "plughw:Snowball" # use a specific ALSA device for this run
System audio routing (recommended)
On Linux, set audio_device to "pulse" or "pipewire" and select the active microphone in pavucontrol or your desktop sound settings.
audio_device: "pulse" # PulseAudio routes to whichever mic is set as default input
Targeting a specific device directly
audio_device: "plughw:Snowball" # ALSA plughw — format conversion included
audio_device: 3 # device index from --list-devices
| Value | System | Notes |
|---|---|---|
"pulse" |
PulseAudio (Linux) | Recommended; configure mic via pavucontrol |
"pipewire" |
PipeWire (Linux) | Recommended on modern Linux |
"plughw:Snowball" |
ALSA (Linux) | Direct USB mic access with format conversion |
"hw:2,0" |
ALSA (Linux) | Raw direct access, card 2 device 0 |
3 |
Any | Device index from --list-devices |
"BlackHole 2ch" |
macOS | Virtual routing device |
Audio preprocessing (noise reduction)
Audio preprocessing enhances a recording before transcription by filtering unwanted signal content. Noise reduction is a key preprocessing technique that removes background sound — café chatter, fan hum, keyboard clicks — while preserving speech intelligibility. Preprocessing is useful in noisy environments or when your ASR backend struggles with poor signal quality, producing errors or hallucinations. kaiku provides three noise reduction libraries, each with different strengths depending on noise type and available compute resources.
Available preprocessors
| Name | Technology | Dependencies | Strengths | Weaknesses |
|---|---|---|---|---|
none |
— | none (default) | No overhead; baseline for clean recordings | No noise reduction; ASR errors in noisy conditions |
noisereduce |
Spectral subtraction | scipy only | Low CPU; scipy-only install; excellent for stationary noise (hum, AC, fans); live recording friendly | Ineffective on crowd/babble noise; may remove speech texture; limited to repeating noise patterns |
pyrnnoise |
Mozilla RNNoise GRU | scipy only | Handles non-stationary noise (crowd, footsteps, babble) better than spectral; learned noise patterns; no special hardware needed | 16→48→16 kHz resampling overhead on 16 kHz audio; slower than noisereduce |
deepfilter |
DeepFilterNet3 neural | torch + Rust wheel | Best overall quality; handles mixed noise types; preserves speech naturalness and dynamics; recommended for important recordings | Highest CPU usage; heaviest dependencies; slowest option; overkill for live dictation |
Choosing noise reduction by noise type and ASR backend
ASR backend noise robustness: Different backends have built-in resilience to noise due to their training data. Whisper models (including whisper.cpp and cloud APIs using whisper-large-v3-turbo like Groq) are trained on diverse YouTube videos containing natural background noise, making them inherently robust to non-stationary sounds. SenseVoice (via sherpa-onnx) is also fairly noise-tolerant. However, all backends still benefit from preprocessing in truly noisy settings (café, open office, street noise).
Preprocessing strategy by scenario:
-
Clean or quiet environment (home office, studio): Skip preprocessing (
none) and let your backend's training handle any minor noise. Saves CPU and latency. -
Steady, repeating background noise (office AC, ceiling fan, electrical hum): Use
noisereduce. Spectral subtraction is highly effective on stationary patterns, adds minimal latency, and works well with any ASR backend. Ideal for live toggle recording. -
Variable, crowd noise (meetings, café, office chatter): Use
pyrnnoise. Its neural GRU approach learns diverse noise patterns better than spectral methods. Note the 16→48→16 kHz resampling overhead on standard 16 kHz recordings — acceptable for file transcription, less ideal for live recording. Pairs well with Whisper-based backends, which are already trained on YouTube's ambient noise. -
Highest quality regardless of noise type (important interviews, archival, transcription service): Use
deepfilter. The deepest neural processing delivers the cleanest output, but requires substantial CPU and time. Reserve this for offline file processing where latency and compute cost can be amortized.
Platform-specific guidance:
- Live toggle recording on CPU-constrained devices (
--toggle): Prefernoisereduceoverpyrnnoiseordeepfilterto minimize latency and system load. - Local
whisper.cppbackend: The backend is fast but already somewhat noise-robust; considernonein quiet settings ornoisereducefor steady background. Avoiddeepfilteron CPU systems. - Cloud API (Groq, OpenAI): Optional preprocessing for mildly noisy audio; use preprocessing aggressively for café-grade noise to maximize transcription quality.
- Speaker diarization (
-b whisperx): Preprocessing before diarization is recommended in noisy settings — speaker separation depends on clear voice boundaries, which noise obscures.
Loudness normalisation
To complete the audio enhancement after noise reduction with any of the three preprocessors, kaiku applies a loudnorm pass (RMS → −20 dBFS, peak ceiling −0.1 dBFS) to ensure the ASR backend receives a consistently strong, unclipped signal.
Installing preprocessors
pip3 install kaiku[noisereduce] # spectral subtraction
pip3 install kaiku[pyrnnoise] # RNNoise GRU
pip3 install kaiku[deepfilter] # DeepFilterNet3
pip3 install kaiku[enhance] # all three
Preprocessor configuration
The preprocessor choice is determined using the first field of each preset spec. Override it for a single run with -p NAME.
Preprocessor usage
kaiku -p deepfilter # denoise live recording with DeepFilterNet
kaiku -p noisereduce # spectral denoising
kaiku -p deepfilter -i talk.mp4 # denoise video file before transcription
kaiku --test # also checks that configured preprocessors are available
Transcription
| Flag | Description |
|---|---|
-b NAME / --backend NAME |
ASR backend to use (key under asr_backends: in config). Overrides the preset's asr_backend. |
-i FILE / --input FILE |
Transcribe an existing audio or video file instead of recording. |
-o FILE / --output FILE |
Append transcripts to file. |
-l LANG / --language LANG |
Language hint (ISO-639-1, e.g. fi, en). Overrides config. Omit to auto-detect. |
-r / --robust |
Robust mode for -i file input: split at silence boundaries, quality-check chunks, retry. |
-C SEC / --chunk-duration SEC |
Max chunk duration in seconds for -r/--robust mode (default: 180). |
-g / --toggle |
Toggle recording: first invocation starts, second stops and transcribes. |
-z / --no-clipboard |
Do not copy transcript (or file path) to the system clipboard; stdout and -o unchanged. |
kaiku # record until Ctrl+C, transcribe, copy to clipboard
kaiku -l fi # Finnish, using local whisper.cpp backend (privacy preset)
kaiku -i audio.mp3 # transcribe an existing audio file
kaiku -i meeting.mp4 # transcribe from a video file (audio extracted automatically)
kaiku -x speed -o transcript.txt # preset + append transcript to a file
Supported input formats
- Audio:
wav,mp3,m4a,ogg,flac,aac,opus,wma - Video:
mp4,mov,mkv,webm,avi,flv,mvi
Requires ffmpeg on PATH for non-WAV input. Video streams are discarded automatically; basic spectral cleaning (highpass 200 Hz + lowpass 3 kHz + loudnorm) is applied during conversion.
Language support
Use -l LANG with an ISO-639-1 code (e.g. fi, fr, de, ja) to force a specific language. Omit to auto-detect.
Whisper models distinguish between high-resource languages (English, Spanish, French, German, Portuguese, Italian, Japanese, Korean, Chinese) — which have abundant training data and near-human accuracy — and lower-resource languages with more variable results. Finnish (fi), for example, is well-handled by Whisper large-v3 and scores around 15% WER in benchmarks, which is usable for most purposes. Rare languages may require a language hint to avoid misdetection.
Backend-specific notes:
- whisper.cpp with
ggml-large-v3-turbooffers the best multilingual accuracy of any local setup and is the recommended choice for non-English. - Groq (
whisper-large-v3-turbo) provides equivalent accuracy over API at very low latency — good for live recording in any supported language. - SenseVoice (via sherpa-onnx) is exceptional for Chinese, Japanese, Korean and emotion/event detection, but its language coverage is narrower than Whisper.
- OpenAI API (
whisper-1) supports the same 99 languages as the Whisper model family. Most cloud backends default to English if no hint is provided.
File and clipboard output
-o FILE appends each transcript to the specified file, with a timestamp header prepended to each entry. The file is created if it does not exist.
Continuous and chunked modes integrate with -o differently:
- In robust/chunked mode (
-r), raw chunk text is appended to-oFILE as each chunk finishes (tail -fshows progress). After all chunks, the file is read back, optionally post-processed, formatted, and replaced with that final text only (no timestamped append). - In VAD mode (
--vad) and interval mode (--interval), each transcribed utterance is written immediately after the silence boundary triggers.
Clipboard size limit: when a transcript exceeds ~4 000 characters and -o FILE is specified, the file path is copied to clipboard instead of the full text.
ASR backends
ASR (Automatic Speech Recognition) converts spoken audio into text. kaiku supports several ASR backends, from cloud APIs to fully offline local inference. Speaker diarization backends (type: whisperx) live here too — they replace the regular transcription step and produce speaker-attributed output.
ASR backend configuration
ASR backends are defined under asr_backends: and referenced by name from presets or overridden per-run with -b NAME. kaiku --generate-config writes a fully annotated config with every supported backend type.
asr_backends:
openai:
type: api
api_base_url: "https://api.openai.com/v1/"
api_key: "YOUR_API_KEY"
model_name: "whisper-1"
groq:
type: api
api_base_url: "https://api.groq.com/openai/v1/"
api_key: "YOUR_GROQ_KEY"
model_name: "whisper-large-v3-turbo"
kaiku -b groq -i audio.wav # use groq backend for this run
kaiku --test -b openai # test a specific backend
Supported backend types
type |
Description | Requires |
|---|---|---|
api |
Any OpenAI-compatible HTTP endpoint (OpenAI, Groq, SiliconFlow, xinference, etc.) | API key or local server |
whisper_cpp |
whisper.cpp binary via subprocess | whisper.cpp build + .bin model file |
whisperx |
WhisperX speaker diarization — ASR + word alignment + speaker attribution in one pass; output: [HH:MM:SS] SPEAKER_NN: text |
pip3 install kaiku[diarize], HF token |
mock |
Fixed-response mock for testing and demos | None — no credentials needed |
mock-fwd |
Duration-proportional transcript mock (forward word order) | None |
mock-bwd |
Duration-proportional transcript mock (reverse word order) | None |
Mock backends for testing
The mock backends return transcripts without making API calls or running external processes. Useful for development, demos, and CI pipelines:
asr_backends:
demo:
type: mock
response: "The quick brown fox jumps over the lazy dog"
latency_ms: 100 # Optional: simulate network delay
mock-fwd:
type: mock-fwd
transcript_path: "test_data/group-2p-2.txt" # source for word pool
kaiku -b demo -i dummy_audio.wav # Returns mock transcript instantly
kaiku --test -b demo # No credentials needed
kaiku -i audio.wav -b mock-fwd # Duration-proportional words from transcript
Local ASR: whisper.cpp vs. sherpa-onnx
Both provide fully offline, no-API-key ASR. Here is how to choose:
| whisper.cpp | sherpa-onnx (via --serve) |
|
|---|---|---|
| What it is | C++ reimplementation of OpenAI Whisper | ONNX-runtime inference with Python bindings |
| ASR models | Whisper family (GGML quantised) | Whisper, SenseVoice, paraformer, zipformer, and more |
| Language coverage | 99 languages — full OpenAI Whisper training set | Varies by model; SenseVoice default covers ~50 languages (strongest for CJK); Whisper models via sherpa-onnx add 99 |
| Multilingual quality | Best local option for European and other non-English languages; large-v3-turbo recommended |
SenseVoice leads for Chinese, Japanese, Korean and handles emotion/event detection; weaker for many European languages |
| Python ML deps | None — single binary + model file | Yes — ONNX runtime and sherpa-onnx Python packages |
| Integration | Subprocess call to external C++ binary | Python-native; exposes a local HTTP API |
| Setup | Build C++ from source; download .bin model manually |
pip3 install kaiku[vad] + kaiku --download-model |
| Model auto-download | No | Yes |
| VAD support | No | Yes (built-in via sherpa-onnx) |
| Dev activity | Mature, stable | Very active (k2-fsa / Next-gen Kaldi team) |
When to choose whisper.cpp: Your primary language is non-English — especially European languages (Finnish, German, French, etc.) where ggml-large-v3-turbo delivers the best local accuracy of any backend. Also the right choice when you want no Python ML package dependencies: the binary and model file are self-contained, with nothing added to your Python environment.
When to choose sherpa-onnx: You are transcribing Chinese, Japanese, or Korean (SenseVoice is the stronger choice there), you want model auto-download and zero C++ build steps, you already installed kaiku[vad] (sherpa-onnx is already present), or you need VAD support or access to models beyond the Whisper family.
See whisper.cpp for build instructions and sherpa-onnx for its model zoo.
Robust long-file transcription
For long recordings, -r/--robust splits at silence boundaries, quality-checks each chunk, retries bad chunks, streams raw chunk text to -o FILE as it goes, then overwrites FILE with the post-processed, template-formatted result:
kaiku -i meeting.mp3 -r # chunked, quality-checked
kaiku -i m.mp3 -rC 60 # 60 s chunks instead of default 180
kaiku -i m.mp3 -ro transcript.txt # tail -f during chunks; final file is formatted output only
kaiku -i m.mp3 -l fi -o t.txt # fully offline, Finnish language
kaiku -i m.mp3 -rP group-restructure -T bare -o t.md # AI meeting memo without the transcript
Long transcripts often exceed the clipboard size limit; using -o FILE is recommended.
Toggle mode
Toggle mode lets you bind a single keyboard shortcut to start and stop recording. The recording runs as a background process; the second invocation stops it, transcribes, and copies to clipboard. A desktop notification is shown on start and finish (requires notify-send on Linux).
kaiku --toggle # first press: start recording in background
kaiku --toggle # second press: stop, transcribe, copy to clipboard
kaiku --toggle # toggle with fully offline transcription
kaiku --toggle -P solo-restructure # toggle → structured personal memo
Example awesome WM keybinding:
awful.key({ modkey }, "r", function()
awful.spawn("kaiku --toggle")
end)
Requirements for toggle mode to work
Toggle mode requires a POSIX system (Linux or macOS). Windows is not supported: the recorder subprocess relies on signal.pause() (POSIX-only) and SIGTERM for graceful stop-and-write; the Windows equivalents would require reimplementing IPC with Win32 named pipes or events.
Recorder backends — Toggle mode needs audio to be captured in the background, and currently the alternative recorder backends are:
| Recorder | Platform | Requirement |
|---|---|---|
sounddevice (default) |
cross-platform | already a dependency |
arecord |
Linux / ALSA only | alsa-utils system package |
From these, sounddevice is used by default.
If you want direct ALSA access, set recorder: arecord in config.
Local ASR server
kaiku can run a local OpenAI-compatible ASR API server backed by sherpa-onnx.
| Flag | Description |
|---|---|
--serve |
Start the local sherpa-onnx ASR API server |
--download-model |
Download the SenseVoice model and exit |
--host HOST |
Server bind address (default: 127.0.0.1) |
--port PORT |
Server bind port (default: 8000) |
--model-dir DIR |
Path to ASR model directory |
--num-threads N |
Inference threads (default: 4) |
pip3 install kaiku[vad]
kaiku --download-model # download SenseVoice model (~1 GB, once)
kaiku --serve # start server at 127.0.0.1:8000
Corresponding config backend:
asr_backends:
sonnx:
type: api
api_base_url: "http://127.0.0.1:8000/v1/"
model_name: "SenseVoiceSmall"
VAD (continuous recording)
VAD (Voice Activity Detection) classifies audio frames as speech or silence, enabling hands-free continuous transcription that triggers automatically at the end of each utterance.
| Flag | Description |
|---|---|
--vad |
Continuous recording with voice activity detection. Transcribes when silence is detected after speech. Requires pip3 install kaiku[vad]. |
--interval SEC |
Continuous recording with fixed interval (seconds). |
--silence-threshold PROB |
Speech probability threshold, 0.0–1.0 (default: 0.5); lower = more sensitive. |
--silence-duration SEC |
How long silence must last to trigger transcription (default: 1.5 s). |
Continuous recording modes
| Mode | Flag | Trigger | Use case |
|---|---|---|---|
| Voice Activity Detection | --vad |
Silence after speech | Meetings, dictation, any unscripted speech |
| Fixed interval | --interval SEC |
Every N seconds | Lectures, podcasts with predictable pauses |
kaiku --vad -o ~/meeting.txt # auto-transcribe when silence is detected
kaiku --interval 60 -o ~/meeting.txt # transcribe every 60 seconds
VAD requires sherpa-onnx:
pip3 install kaiku[vad]
VAD uses the Silero VAD model (~629 KB, downloads automatically on first use). No internet connection required after the first run.
Diarization
Speaker diarization attributes each spoken segment to a speaker label, producing a transcript where every turn is tagged [HH:MM:SS] SPEAKER_NN: text.
Diarization is implemented as an ASR backend (type: whisperx) — it replaces the regular transcription step entirely, so there is no double pass.
Speaker name substitution (SPEAKER_00 → real names) is intentionally left to the calling assistant or post-processor.
| Flag | Description |
|---|---|
-b whisperx |
WhisperX diarization backend (type: whisperx). |
-s N / --speakers N |
Speaker count hint for whisperx. Ignored by other backends. |
Diarization setup
pip3 install kaiku[diarize]
# Accept the pyannote licence at https://huggingface.co/pyannote/speaker-diarization-3.1
# then set your HuggingFace token:
export HF_TOKEN=hf_...
HF_TOKEN is your HuggingFace access token. You must also accept the license for pyannote/speaker-diarization-3.1 on HuggingFace before the model can be downloaded. The token is only needed on first run; the model is cached locally thereafter.
Diarization config
Define the backend among your ASR backends and reference it in a preset or pass -b whisperx:
asr_backends:
whisperx:
type: whisperx
hf_token: "hf_..." # or set HF_TOKEN env var
min_speakers: 2 # optional hint to pyannote
max_speakers: 6 # optional hint to pyannote
Diarization usage
kaiku -i meeting.m4a -b whisperx # diarize: SPEAKER_NN-attributed transcript
kaiku -i meeting.m4a -b whisperx -s 3 # hint: 3 speakers (improves accuracy)
kaiku -i meeting.m4a -b whisperx -P group # diarize + LLM meeting notes
Post-processing (with AI models)
Post-processing refines transcripts by passing them through an artificial intelligence system with custom instructions. Use this to fix transcription errors, improve grammar, condense transcripts, or restructure them into consistently formatted memos with essential information extracted. The feature is especially valuable for frequent dictators (researchers, journalists, managers) and teams with important discussions, decisions, tasks and timelines.
| Flag | Description |
|---|---|
-P NAME / --post NAME |
LLM post-processor name (key in postprocessors: config) or an inline system-prompt string. Overrides the preset's post-processor. |
-M MODEL / --post-model MODEL |
LLM model used for post-processing. Overrides the post-processor config for this run. |
-T NAME / --template NAME |
Output template name from output_templates: in config. Controls what is written to clipboard / -o FILE. |
Mock post-processor for testing
The mock post-processor analyzes prompts and transcripts without making API calls, returning linguistic statistics. Useful for testing post-processing workflows, demos, and CI pipelines without credentials:
postprocessor_backends:
mock:
type: mock
model: Claude-Opus # Required; becomes "Claude-Opus" in the signature
postprocessors:
analyze:
backend: mock
prompt: "Enhance this transcript"
The output shows analysis of both the system prompt and the input transcript:
Prompt analyzed: longest=Enhance, shortest=this, most_frequent=enhance, lines=1, words=3, chars=24
Transcript analyzed: longest=transcription, shortest=a, most_frequent=the, lines=2, words=42, chars=245
*Yours truly, Claude-Opus
*
Each analysis includes:
- longest: longest word in the text
- shortest: shortest word in the text
- most_frequent: most common word (case-insensitive)
- lines: number of lines
- words: total word count
- chars: total character count
Usage:
kaiku -P analyze -i audio.mp3 # Analyze with mock
kaiku --test -P analyze # Test without credentials
Post-processors (prompt templates)
Six post-processor specs are provided in the config template as examples; each is a starting point that requires configuration:
- Setup a
postprocessor_backends:entry (Ollama, Groq, Anthropic API, OpenAI, Claude Code, or any OpenAI-compatible endpoint) - Assign that backend to the prompt (via the
backend:field) or set a default withpostprocessor_urgent/postprocessor_casual - Update the context file list via
context_path:to help the LLM understand context if required; Delete it if no extra context neeeded.
Examples below; see the full configuration section for all available fields.
Personal dictation prompts
| Name | Purpose |
|---|---|
solo-enhance |
Improve quality, fix grammar and word choice while honoring the author's style |
solo-restructure |
Restructure a personal dictation into a structured memo with sections |
solo-private |
Like solo-restructure but defaults to a local offline model to ensure privacy |
Group discussion prompts
| Name | Purpose |
|---|---|
group-enhance |
Improve quality of group transcript while honoring each speaker's style |
group-restructure |
Restructure a group discussion into a meeting memo with summary, decisions, action items |
group-private |
Like group-restructure but defaults to a local offline model to ensure privacy |
Tips:
- To get richer and more accurate memos from meetings and debates, use a diarization backend (
-b whisperx) to attribute segments to speakers before post-processing. - Add context files (see configuration below) to help the model better connect the dots between each individual participant and the contents discussed.
Post-processor usage examples
kaiku --toggle -P solo-enhance # toggle → improved personal transcript
kaiku --toggle -P solo-restructure # toggle → structured personal memo
kaiku -i meeting.m4a -b whisperx -P group-restructure # diarize + meeting memo
kaiku --toggle -P "List action items." # inline system prompt
Supported AI backends
| Backend type | What it covers |
|---|---|
openai_compat |
Ollama (local), Groq, Anthropic API, OpenAI, any OpenAI-compatible endpoint |
claude_code |
Claude Code CLI — uses your CC session/subscription, no per-token billing |
Post-processor backend setup
postprocessor_backends:
ollama:
type: openai_compat
api_base_url: "http://localhost:11434/v1/"
api_key: "ollama"
model: "qwen3:14b"
cc:
type: claude_code
model: "claude-haiku-4-5-20251001"
Post-processor configuration
Each prompt under postprocessors: can have the following fields:
| Field | Type | Description |
|---|---|---|
prompt: |
string | System prompt text. Required unless extends: is used. |
extends: |
string | Name of another prompt to inherit from. Inherits prompt, backend, model, and context_path from parent. |
extra: |
string | Text appended to the inherited prompt (only with extends:). |
backend: |
string | Which postprocessor_backends: entry to use. Overrides inherited backend. |
model: |
string | Override the backend's default model for this prompt. |
template: |
string | Output template name from output_templates: section (default: default). Controls final output format. |
context_path: |
list of strings | File glob patterns to inject as context. Glob patterns are expanded and combined with inherited patterns when using extends:. |
Inheritance behavior: When a prompt uses extends:, it inherits all parent fields. Child fields override parent fields, except context_path: which accumulates (both parent and child patterns are expanded and combined).
Context file formatting: Files specified in context_path: are injected into the LLM prompt with an index and clear delimiters. Each file appears with its name and visual separators, making the context easily scannable for the LLM.
Example with inheritance:
postprocessors:
solo-base:
backend: groq
prompt: |
You are a professional transcript scribe ...
context_path:
- "~/.kaiku/context/personal.md" # Context file to help the LLM "read between the lines"
solo-enhance:
extends: solo-base # inherits backend, prompt
extra: |
Also improve grammar and style ...
solo-private:
extends: solo-enhance # inherits backend, prompt + extra
backend: ollama # override: use local model for privacy
context_path:
- "~/.kaiku/context/private-*.md" # accumulates with parent's context
Note: The solo-base and group-base are not intended to be used directly. Instead, they provide definitions that are shared by other single-speaker and group discussion post-processors, respectively, through inheritance.
Output templates
Output templates control what ends up in clipboard / -o FILE.
The output format depends on your template and potentially your LLM instructions — can be Markdown, ReStructuredText, plain text, or any text-based format.
Two templates are shipped; select with -T NAME:
output_templates:
raw: "{transcript}" # No post processing
bare: "{result}" # Only the AI models output
full: | # Both the AI output and the transcript on a Markdown template
{result}
---
*Transcript from {duration_s:.0f}s recording post-processed at {datetime} with kaiku ({backend}, {prompt_name}, {model})*
## Original transcript
{transcript}
kaiku -i m.mp3 -r -P group -T full # meeting notes + full transcript appended
Available placeholders: {result} {transcript} {date} {datetime} {prompt_name} {model} {backend} {duration_s}
Set postprocessor_urgent / postprocessor_casual to apply a prompt automatically for every recording without passing -P.
Presets
Presets are atomic pipeline definitions — each specifies exactly which preprocessor, ASR backend, and post-processor to use. Pick one preset per run; kaiku handles the rest.
Format: preset_name: [preprocessor, asr_backend, postprocessor, description]
- Preprocessor — audio denoising:
none,noisereduce,pyrnnoise, ordeepfilter - ASR backend — any key from
asr_backends:(including diarization backends liketype: whisperx) - Postprocessor —
noneor any key frompostprocessors: - Description — user-facing label shown by
--list/--test
All four fields are required; this keeps every processing stage explicit.
asr_backends:
groq:
type: api
api_base_url: "https://api.groq.com/openai/v1/"
api_key: "YOUR_GROQ_KEY"
model_name: "whisper-large-v3-turbo"
wcpp:
type: whisper_cpp
binary_path: "/usr/local/bin/whisper-cli"
model_path: "/models/ggml-large-v3-turbo.bin"
presets:
speed: [ none, groq, none, "Fast transcription with minimal processing"]
quality: [ deepfilter, groq, none, "High-accuracy with neural denoising"]
privacy: [ none, wcpp, none, "Fully offline, local transcription"]
balanced: [ noisereduce, groq, none, "Good balance of speed, quality, and cost"]
memo: [ noisereduce, groq, solo-enhance, "Transcription + LLM memo enhancement"]
kaiku --preset speed # record & transcribe, copied to clipboard
kaiku --preset privacy --toggle # toggle mode with fully offline transcription
kaiku --preset quality -i audio.mp3 # transcribe file
kaiku --preset speed -b wcpp # override backend for this run
kaiku --preset memo -P group-enhance # override post-processor for this run
Default preset: set default_preset: speed in config to omit --preset on every invocation. Flags -b, -p, and -P still override individual stages.
default_preset: speed
Troubleshooting
| Problem | Solution |
|---|---|
| Audio not captured | Run kaiku --list-devices and select a working device |
| Clipboard not working | Install xclip (X11) or wl-clipboard (Wayland) |
| API errors | Check your API key and endpoint in config |
| whisper.cpp errors | Run kaiku --test -b wcpp; check binary and model paths |
| Silent audio | Try a different audio device with --device |
| Video/audio format rejected | Ensure ffmpeg is installed (apt install ffmpeg / brew install ffmpeg) |
| Preprocessor not found | Run kaiku --test to see which are available and their install commands |
| Preprocessing too slow | Switch preprocessor_urgent to noisereduce or none in config |
| Post-processor not found | Check postprocessors: in config; name must match exactly |
| Post-processor backend error | Check postprocessor_backends: in config; verify API key and URL |
| Diarization fails | Ensure pip3 install kaiku[diarize], HF_TOKEN is set, pyannote licence accepted, and backend type: whisperx is in asr_backends: |
Run kaiku --test (or kaiku --test -b <name>) to diagnose issues.
Contributing
Fork the repository and submit a pull request. Any improvements or new features are welcome! :)
Testing
Development relies on a small but powerful black-box E2E suite: the real CLI runs as a subprocess and assertions cover exit codes, stdout, stderr, and files — see tests/README.md for the full strategy, log-shape notes, and scenario index.
pytest tests/ -v
Disclaimer: Not all features have been properly tested and only on a legacy Ubuntu 20.04 environment. More testing and hardening will be done by June 2026.
License
GNU Affero General Public License v3.0. See the LICENSE file for details.
Related projects
kaiku operates within a four-stage pipeline:
[Audio capture] → [ASR / transcription] → [Post-processing] → [Output: clipboard / file]
stage 1 stage 2 stage 3 stage 4
The tables below cover the ecosystem at each pipeline stage and compare competing end-user tools. kaiku covers the whole pipeline in a single powerful CLI.
Audio preprocessing (noise reduction)
Audio preprocessing cleans the signal before transcription. kaiku integrates all three libraries below as optional extras (g install kaiku[enhance]); they run in a pipeline with loudness normalisation applied after cleaning.
| Project | Technology | License | Best for | In kaiku |
|---|---|---|---|---|
| noisereduce | Spectral subtraction | MIT | Stationary noise: fans, AC, electrical hum | Yes — pip3 install kaiku[noisereduce] |
| pyrnnoise | Mozilla RNNoise GRU | GPL-3 | Non-stationary noise: crowd, babble, footsteps | Yes — pip3 install kaiku[pyrnnoise] |
| DeepFilterNet | DeepFilterNet3 neural net | MIT | Best quality overall; speech naturalness; medium CPU | Yes — pip3 install kaiku[deepfilter] |
| RNNoise | Xiph GRU | BSD | Original Mozilla RNNoise (C library) | No — pyrnnoise wraps this at the Python layer |
| SpeechBrain enhance | Encoder-decoder neural | Apache-2 | Research-grade speech separation and denoising | No — heavy ML framework dependency; not practical as a live preprocessor |
Voice Activity Detection
VAD classifies audio frames as speech or silence, enabling automatic segment boundaries without user interaction. kaiku uses Silero VAD (bundled in sherpa-onnx) for both the --vad continuous mode and the silence-split inside --robust.
| Project | License | Stars | Notes | In kaiku |
|---|---|---|---|---|
| Silero VAD | MIT | 14k+ | 629 KB model; enterprise-grade; ONNX + PyTorch; auto-downloads | Yes — via sherpa-onnx in --vad and --robust |
| WebRTC VAD | BSD | 1k+ | Google's classic GMM-based VAD; very fast, lower accuracy | No — less accurate than Silero; not integrated |
| pyannote VAD | MIT | 6k+ | Neural VAD embedded in the pyannote diarization pipeline | Indirectly — activated by the whisperx ASR backend |
ASR engines
ASR engines convert audio to text. kaiku is a frontend: it delegates transcription to an ASR backend, supporting two locally-run backends (whisper.cpp, sherpa-onnx) and any OpenAI-compatible HTTP endpoint for cloud or self-hosted services. The engines listed as integrated below are ones kaiku directly calls or supports as backends; the others are libraries or specialized tools that require custom wrappers to use.
| Project | License | Stars | Best for | In kaiku |
|---|---|---|---|---|
| OpenAI Whisper | MIT | 80k+ | Gold standard; 99 languages; most widely reproduced | Via API (whisper-1) or indirectly through sherpa-onnx and whisper.cpp |
| whisper.cpp | MIT | 80k+ | Fully offline; best CPU performance; GGML-quantised models | Yes — type: whisper_cpp backend; subprocess call |
| faster-whisper | MIT | 15k+ | 4× faster than Whisper; identical accuracy; INT8/FP16 via CTranslate2 | Not directly; used internally by WhisperX and Meetily |
| sherpa-onnx | Apache-2 | 4k+ | ONNX inference; multi-model-family; model auto-download; Python-native | Yes — --serve local server; type: api backend |
| WhisperX | BSD | 13k+ | Whisper + word-level timestamps + speaker diarization in one pipeline | Yes — type: whisperx backend (pip3 install kaiku[diarize]) |
| SenseVoice | Apache-2 | 6k+ | Emotion + language event detection; excellent CJK | Via sherpa-onnx default model; also SiliconFlow API |
| Vosk | Apache-2 | 8k+ | Lightweight; 20+ languages; embedded and low-RAM devices | No — lower accuracy than Whisper family |
| NVIDIA Parakeet TDT | Apache-2 | (NeMo) | 3 380× faster than real-time; English only; GPU | No — English-only; GPU-dependent; no multilingual support |
| SpeechBrain | Apache-2 | 9k+ | Research platform; fine-tuning; custom model training | No — research library, not a drop-in backend |
| Coqui STT | MPL-2 | 5k+ | DeepSpeech successor; trainable on custom data | No — lower quality than Whisper; limited community activity |
| Kaldi | Apache-2 | 14k+ | Enterprise/research; highly configurable; steep setup | No — complex; not a practical CLI backend |
Speaker diarization
Speaker diarization labels each segment with a speaker identity ("who said what"). kaiku treats diarization as an ASR backend (type: whisperx); select it with -b whisperx or via a preset. WhisperX handles transcription and speaker attribution in one pass; pyannote.audio is used internally for speaker embedding and clustering.
| Project | License | Stars | Notes | In kaiku |
|---|---|---|---|---|
| pyannote.audio | MIT | 6k+ | De-facto OSS standard; speaker embedding + clustering; requires HF token for model download | Via WhisperX (type: whisperx) |
| WhisperX | BSD | 13k+ | faster-whisper + word alignment + pyannote; all-in-one | Yes — type: whisperx backend (pip3 install kaiku[diarize]) |
| whisper-diarization | MIT | 2k+ | faster-whisper + pyannote script pipeline | No — WhisperX provides equivalent functionality with an active upstream |
| NVIDIA NeMo | Apache-2 | 13k+ | Fastest GPU diarization; English and enterprise focus | No — GPU-heavy; no practical CLI integration path |
Desktop audio capture and transcription tools
These are end-user tools that combine audio capture, ASR, and transcript output — the closest category to kaiku itself.
| Project | Type | Platform | License | Live | File | Toggle | VAD | Offline | Diarize | LLM post | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|
| kaiku (this) | CLI | Linux, macOS | AGPL-3 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Full pipeline; scriptable; multi-backend; video input |
| Turbo Whisper | GUI | Linux | MIT | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | faster-whisper-large-v3-turbo; global hotkey; no casual mode; PPA install |
| Whispering | GUI/tray | Any | MIT | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | Cross-platform (snap/exe); local or cloud API; minimal UI |
| Superwhisper | GUI | macOS, Windows, iOS | Proprietary | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ | Partial | Premium dictation app; polished UX; no Linux |
| Meetily | Desktop app | macOS, Windows | MIT | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | 11.9k★; Rust backend; Ollama summaries; no Linux |
| Screenpipe | Agent layer | Any | MIT | ✓ (always-on) | ✓ | ✗ | ✓ | ✓ | ✗ | Via MCP | 18.6k★; ambient 24/7 recording; MCP server; not a CLI tool |
Agentic voice-to-text frameworks
For AI agents a streaming architecture is often preferred: continuous listening, local wake-word detection, and real-time VAD. This allows hands-free triggering of agentic actions and conversation that feels natural. In contrast, kaiku serves as a high-precision bridge: It is currently the most complete CLI-native pipeline for when an agent (or developer) needs to process a specific audio file or a manual "push-to-talk" segment with maximum control over all the processing stages. The table below lists some of the most notable choices for AI assistants:
| Project | Primary Tech | Wake Word | VAD | Agent Integration | vs. kaiku |
|---|---|---|---|---|---|
| LiveKit Agents | Python / Rust | Optional | Silero | WebSocket / WebRTC | Full Agent Framework: High-performance streaming for voice-to-voice; kaiku is a CLI tool for text generation. |
| Wyoming Satellite | Python | open-wakeword | Silero | Wyoming Protocol | Smart Home Focus: Designed as a background daemon for Home Assistant; kaiku is a foreground productivity tool. |
| Rhasspy 3 | Modular (C++/Python) | Porcupine / Snowboy | WebRTC / Silero | MQTT / Unix Sockets | Deeply Modular: Can swap every component; kaiku is more integrated and opinionated for CLI users. |
| LocalAI | Go / C++ | Yes (via API) | Yes | OpenAI-compatible API | The Server Hub: Acts as an all-in-one local API server; kaiku acts as a client that can call such servers. |
| Whisper Mic | Python | No | Silero | Stdout / Text stream | Simple Loop: A continuous transcription script; lacks kaiku’s noise reduction, diarization, and post-processing. |
| Leon | Node.js / Python | Yes | Yes | Custom SDK / Web | Full Assistant: Includes skills, memory, and UI; kaiku is a specialized "sensor" for such an assistant. |
SaaS meeting assistants
Commercial cloud services that join calls automatically or process uploaded recordings. Included for context — these require trusting a third party with your audio.
| Service | Platform | Bot-less | Offline | Privacy | Notable | vs kaiku |
|---|---|---|---|---|---|---|
| Fathom | Web (Zoom/Meet/Teams) | No | No | Cloud (US) | Free tier; calendar-integrated; good UX | Cloud-only; no file processing; no Linux CLI |
| Jamie | macOS, Windows | Yes | No | GDPR (EU) | Best Finnish quality; bot-free desktop app | €24+/mo; macOS/Windows only |
| Fireflies.ai | Web | No | No | Cloud | 100 languages; CRM sync; "Ask Fred" AI queries | Cloud; BIPA lawsuit 2025; data outside EU |
| Granola | macOS | Yes | Partial | Cloud for AI | Calendar-integrated; note editor; bot-free | macOS only; AI processing requires cloud |
| Otter.ai | Web, mobile | No | No | Cloud (US) | Real-time captions; widely known | Class-action lawsuit 2025; weak Finnish |
| Soniox | API | — | No | Cloud (US) | Best Finnish WER (10.6%); 56 languages; developer API | API service, not an end-user tool |
| Krisp | App + SDK | Yes | Partial | Cloud for AI | Industry-leading noise suppression + transcription | Proprietary; subscription; not scriptable |
kaiku as an open source contribution
The speech-to-text tool landscape in 2026 has a sharp divide: powerful Python libraries (faster-whisper, WhisperX, pyannote) that require programming to use, and polished end-user apps (Superwhisper, Meetily, Granola) that are macOS/Windows-only or cloud-dependent. Linux users who want local, private, keyboard-shortcut-driven transcription with the full power of the Whisper ecosystem face a gap. Turbo Whisper and Whispering address the simplest dictation case but lack file transcription, noise reduction, robustness for long recordings, and any programmable post-processing. kaiku fills this gap as a single composable CLI that exposes the full four-stage pipeline without requiring the user to write any code.
Beyond Linux, kaiku's value is its scriptability and composability. Every feature — backend, preprocessor, language, diarization, post-processor, output template — is a flag or config key. This makes it naturally callable from shell scripts, Makefiles, cron jobs, and AI coding agents: one invocation covers the full audio → transcript → structured-memo pipeline that would otherwise require stitching together three or four Python libraries. The support for both local (whisper.cpp, sherpa-onnx) and cloud (OpenAI, Groq, SiliconFlow) backends with a unified interface means the same command works offline on a laptop and in a cloud pipeline on a headless server.
The most capable competing open-source project, Meetily, is architecturally similar in ambition — local-first, offline, Whisper-backed, with LLM summaries — but is a GUI-only desktop app for macOS and Windows with no Linux support and no CLI surface. Screenpipe is a different paradigm (always-on ambient capture) rather than a competing tool. This leaves kaiku as currently the most complete open-source, Linux-native, CLI-accessible speech processing pipeline — a category with no direct competition and clear utility for developers, power users, and autonomous AI agents that need to process human speech.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kaiku-1.2.2.tar.gz.
File metadata
- Download URL: kaiku-1.2.2.tar.gz
- Upload date:
- Size: 171.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.13 {"installer":{"name":"uv","version":"0.11.13","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e234ecbdd4859f2b1bb15be4e558f68d183c9a70897b956f05553dc21f039abf
|
|
| MD5 |
bc84c898676bfb9a1768404047356664
|
|
| BLAKE2b-256 |
bd27b71614deeec514b9979e77be353dc1b2b24d3d190d6f50dc3aa23c6bc073
|
File details
Details for the file kaiku-1.2.2-py3-none-any.whl.
File metadata
- Download URL: kaiku-1.2.2-py3-none-any.whl
- Upload date:
- Size: 144.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.13 {"installer":{"name":"uv","version":"0.11.13","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fed3128d099300e3d87992f3f8ffe899e8d50285befd82c723d6375ecc07c510
|
|
| MD5 |
4c918f39aab2330eb4ce74721e7fcabc
|
|
| BLAKE2b-256 |
5b89d908ff67ae4b124703d107b6c5b67bde9aa6f56947956091e6e813fa0c70
|