Audio-to-Markdown transcription optimized for AI consumption

These details have not been verified by PyPI

Project description

Audium

🎧 Audio → AI‑optimized Markdown
_{Transcribe MP3/WAV/FLAC into clean, token‑efficient Markdown — ready for any LLM.}

English · Русский · 中文

✨ Why Audium?

Feed audio to an LLM. Get answers. Simple.

But raw transcripts burn tokens on noise: long timestamps, filler words, silent segments, markup that adds nothing.

Audium turns speech into the minimum viable Markdown: every character counts, nothing wasted.

🎯	⚡	🪙	👁️	🌍
5 formats	GPU‑accelerated	Token‑aware	Watch + URL + GUI	~97 languages
compact, minimal, structured, srt, vtt	2–10× real‑time on CUDA	`[MM:SS]` + VAD + filler‑strip	files, URLs, desktop, REST API	tiny to large‑v3 + turbo

📦 Install

Requires ffmpeg: sudo apt install ffmpeg / brew install ffmpeg

Recommended: pipx (isolated, no conflicts)

pipx install audium-md

pipx creates its own virtual environment — works on Ubuntu/Debian without PEP 668 errors. Install pipx first: sudo apt install pipx or python3 -m pip install --user pipx

Alternative: uv tool (fastest)

uv tool install audium-md

Fallback: pip with override

pip install audium-md --break-system-packages

Local development

git clone https://github.com/Tamukj/Audium.git
cd Audium
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

🚀 Quick Start

# Process a folder
audium run ./my-recordings/

# Single file
audium run lecture.mp3

# Watch folder — auto‑transcribe new files
audium watch ./incoming/

# See what you've transcribed
audium list

# Change model
audium config set model large-v3

📝 Formats

compact (default)

# lecture.mp3 (01:23:45)

[00:00] Neural networks learn hierarchical representations
[00:04] Each layer detects increasingly abstract features
[00:08] Early layers find edges and textures
[00:12] Later layers detect objects and scenes

minimal

Neural networks learn hierarchical representations
Each layer detects increasingly abstract features
Early layers find edges and textures
Later layers detect objects and scenes

structured (requires speaker diarization)

# interview.mp3 (00:45:12)

## Alice [00:00-00:30]
Neural networks are a powerful tool. It's important to understand their limitations.

## Bob [00:30-01:15]
I completely agree. Let me walk through an example to make this concrete.

srt (SubRip subtitles)

1
00:00:00,000 --> 00:00:04,000
Neural networks learn hierarchical representations

2
00:00:04,000 --> 00:00:08,000
Each layer detects increasingly abstract features

vtt (WebVTT — HTML5 & browser)

WEBVTT

00:00:00.000 --> 00:00:04.000
Neural networks learn hierarchical representations

00:00:04.000 --> 00:00:08.000
Each layer detects increasingly abstract features

⚙️ Commands

Command	Description
`audium run <path>`	Transcribe audio files or folders
`audium watch <path>`	Watch folder and auto‑process new files
`audium list [dir]`	Show processed transcripts with file sizes
`audium config`	Show current configuration
`audium config set <key> <value>`	Change a setting
`audium config reset`	Reset to factory defaults
`audium config path`	Show config file location
`audium tui [dir]`	Interactive terminal UI with file browser
`audium desktop`	Launch the desktop GUI (Flet)
`audium serve`	Start REST API server (FastAPI)

Common flags for `run` and `watch`

Flag	Default	Description
`-o, --output-dir`	`./transcripts`	Where to save .md files
`-f, --format`	`compact`	`compact` / `minimal` / `structured`
`-r, --recursive`	off	Search subdirectories
`--model`	`small`	`tiny` / `base` / `small` / `medium` / `large-v3`
`--language`	`auto`	Force language code: `ru`, `en`, `zh`, ...
`--strip-fillers`	off	Remove "um", "uh", "like", "мм", "ээ", etc.
`--no-vad`	off	Disable voice activity detection
`--no-progress`	off	Hide the progress bar
`--vocabulary`	—	Custom words to bias Whisper (e.g. `"RAG,LoRA"`)
`--translate <lang>`	—	Translate output to target language (`ru`, `zh`, ...)
`--ask <prompt>`	—	Send transcript to GPT/Claude for summarization
`--workers <N>`	1	Parallel transcription across N threads
`--diarize`	off	Speaker diarization via pyannote (needs HF_TOKEN)
`--chapters`	off	Auto-detect topic changes and add headings
`--api-key`	env	OpenAI-compatible API key for `--ask`

🔧 Configuration

Settings are merged: CLI flags > .audium.yaml (project) > ~/.config/audium/config.yaml > defaults

# Show current config
audium config

# Set a value
audium config set model large-v3
audium config set strip_fillers true
audium config set output_dir ~/Documents/transcripts

# Also works as a shorthand:
echo "audium config model large-v3" → now supported!

# Reset to factory defaults
audium config reset

# Show config file path
audium config path

All Settings

audium config

Output showing current values with accepted options in parentheses:

  beam_size: 5             (integer 1-20)
  compute_type: auto       (auto, float16, int8_float16, int8)
  device: cuda             (cuda, cpu)
  format: compact          (compact, minimal, structured)
  language: auto           (e.g. auto, ru, en, zh, ...)
  min_segment_duration: 0.0  (float, seconds)
  model: small             (tiny, base, small, medium, large-v3, turbo)
  output_dir: ./transcripts  (path)
  recursive: false         (true / false)
  strip_fillers: false     (true / false)
  vad_filter: true         (true / false)

Setting Reference

Key	Default	Description	Options
`model`	`small`	Whisper model size	tiny, base, small, medium, large-v3, turbo
`device`	`auto`	Computation device (auto-detect)	auto, cuda, cpu
`compute_type`	`auto`	Precision for GPU inference	auto, float16, int8_float16, int8
`format`	`compact`	Output Markdown format	compact, minimal, structured
`language`	`auto`	Source language	auto, or any ISO code (ru, en, zh, ...)
`beam_size`	`5`	Beam search width	integer (1-20)
`output_dir`	`./transcripts`	Where .md files are saved	any path
`strip_fillers`	`false`	Remove filler words	true / false
`vad_filter`	`true`	Voice Activity Detection	true / false
`min_segment_duration`	`0.0`	Skip segments shorter than N seconds	float
`recursive`	`false`	Scan subdirectories	true / false

compute_type auto-detection: On GPUs with compute capability ≥ 7.0 (Volta+), float16 is used for best performance. On Pascal GPUs (GTX 10xx), int8_float16 is used. On CPU, int8 is used.

Local config file

Create .audium.yaml in your project root to override defaults per-project:

model: medium
language: ru
format: minimal
output_dir: ./transcripts

First run — model download

On first use, Audium downloads the Whisper model from HuggingFace Hub (~500 MB for small).
The model is cached locally in ~/.cache/huggingface/hub/ — subsequent runs are instant and fully offline.

Why HuggingFace? The models are too large (~500 MB–3 GB) to bundle in a pip package or GitHub repo. They're downloaded once, then cached forever.

Is this legal? Yes. All components are MIT licensed: Whisper (OpenAI), faster-whisper, CTranslate2. Free for personal and commercial use.

🪙 Token Optimization

Audium is built to minimize LLM token cost:

Technique	Savings
`[MM:SS]` instead of `[HH:MM:SS.mmm]`	~30% on timestamps
VAD filtering (skip silence)	15–40% on meeting recordings
Filler‑word stripping	5–10% on conversational speech
`min_segment_duration` threshold	skip noise fragments
One line per segment, no blank lines	~8% vs paragraph output

📊 Model Sizes

Model	Parameters	Speed (GPU)	Best for
tiny	39M	~32× real‑time	Quick drafts, low‑resource
base	74M	~16× real‑time	Dictation, clean audio
small	244M	~6× real‑time	General purpose
medium	769M	~2× real‑time	Accents, noisy audio
large‑v3	1.5B	~1× real‑time	Maximum accuracy

All multilingual models support the same ~97 languages. The size trades accuracy for speed.

🚀 Features

YouTube & URL support

Transcribe any YouTube video, podcast, or audio URL directly:

pip install audium-md[yt]
audium run "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Works with 1,700+ sites via yt-dlp (YouTube, Twitter/X, Vimeo, TikTok, ...).

AI Chapter detection

Auto-detect topic changes with pure math — no external API, instant:

audium run lecture.mp3 --chapters

Adds a clickable chapter index at the top of the transcript:

## 📑 Chapters

1. `[00:00]` Introduction · Neural · Networks
2. `[15:30]` Gradient · Descent · Optimization
3. `[42:00]` Transformers · Attention · Mechanism

LLM-ready output

Post-process directly with GPT, Claude, or any OpenAI-compatible API:

export OPENAI_API_KEY=sk-...
audium run podcast.mp3 --ask "summarize key takeaways in 3 bullet points"

Output includes both the transcript and the LLM response. Works with any OpenAI-compatible endpoint (--api-base http://localhost:11434/v1 for Ollama).

Translate

Transcribe in one language, output in another:

audium run ru-lecture.mp3 --translate en

Speaker diarization

Identify who said what:

pip install audium-md[diarize]
export HF_TOKEN=hf_...
audium run meeting.mp3 --diarize -f structured

🖥️ Desktop GUI

A modern dark-themed desktop app with drag-and-drop, YouTube URL input, and all settings exposed:

pip install audium-md[desktop]
audium desktop

Platform	Install	Run
Windows	`pip install audium-md[desktop]`	`audium desktop`
Linux	`pip install audium-md[desktop]`	`audium desktop`
macOS	`pip install audium-md[desktop]`	`audium desktop`

Bundle as standalone EXE (Windows users — no Python required):

pip install pyinstaller
python scripts/build_desktop.py
# → dist/Audium/Audium.exe  (~150 MB, self-contained)

Terminal UI

Interactive file browser with live preview and keyboard shortcuts:

pip install audium-md[tui]
audium tui ./recordings/

REST API

Run Audium as a server and integrate with any application:

pip install audium-md[server]
audium serve --port 8080

curl -X POST http://localhost:8080/transcribe \
  -F "file=@meeting.mp3" \
  -F "format=compact"
# → {"content": "# meeting.mp3 (00:15:30)\n[00:00] ...", ...}

OpenAPI docs at http://localhost:8080/docs — try it in the browser.

🖥️ GPU Support

Audium automatically detects your GPU and configures itself:

Hardware	Detection	Backend
NVIDIA (all)	`nvidia-smi`	CUDA — best performance
AMD (ROCm)	`/dev/kfd` + `rocm-smi`	ROCm / HIP
Intel (Arc, Iris)	`xpu-smi` / drm	oneAPI / SYCL
CPU only	fallback	int8 quantized

No manual configuration needed. Run audium run ./audio/ and it just works.

Updating

pip install --upgrade audium-md
# or: pipx upgrade audium-md
# or: uv tool upgrade audium-md

Check your current version:

audium --version

📄 License

MIT — do whatever you want. Attribution appreciated.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.6

Jul 1, 2026

0.2.5

Jul 1, 2026

0.2.4

Jul 1, 2026

0.2.3

Jul 1, 2026

0.2.2

Jul 1, 2026

0.2.1

Jul 1, 2026

This version

0.2.0

Jul 1, 2026

0.1.4

Jul 1, 2026

0.1.3

Jul 1, 2026

0.1.2

Jul 1, 2026

0.1.1

Jul 1, 2026

0.1.0

Jul 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audium_md-0.2.0.tar.gz (44.1 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

audium_md-0.2.0-py3-none-any.whl (38.1 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file audium_md-0.2.0.tar.gz.

File metadata

Download URL: audium_md-0.2.0.tar.gz
Upload date: Jul 1, 2026
Size: 44.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Zorin OS","version":"18","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for audium_md-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`66addde7044ef073fea1624febd7436fcb489d22f694e0acc3f51fcabcf14bb7`
MD5	`607d6cdabff28aba979fbcccab2d95c9`
BLAKE2b-256	`e8f89e4ae0aaefadfd253f816f664e55968e18c40337daefd309e033e5dc35f8`

See more details on using hashes here.

File details

Details for the file audium_md-0.2.0-py3-none-any.whl.

File metadata

Download URL: audium_md-0.2.0-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 38.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Zorin OS","version":"18","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for audium_md-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94ee12428e8c3f2dcaa28b14d47a2f162b0e3e19372e154f07fd2fccefe2c40c`
MD5	`6281ef35a7c4d13349be65b8ff7535b7`
BLAKE2b-256	`40af80f1c25c4d74ca94ae3b00475d54449affac00ec63962cf52ac49e91836e`

See more details on using hashes here.

audium-md 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Audium

✨ Why Audium?

📦 Install

Recommended: pipx (isolated, no conflicts)

Alternative: uv tool (fastest)

Fallback: pip with override

Local development

🚀 Quick Start

📝 Formats

compact (default)

minimal

structured (requires speaker diarization)

srt (SubRip subtitles)

vtt (WebVTT — HTML5 & browser)

⚙️ Commands

Common flags for run and watch

🔧 Configuration

All Settings

Setting Reference

Local config file

First run — model download

🪙 Token Optimization

📊 Model Sizes

🚀 Features

YouTube & URL support

AI Chapter detection

LLM-ready output

Translate

Speaker diarization

🖥️ Desktop GUI

Terminal UI

REST API

🖥️ GPU Support

Updating

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Common flags for `run` and `watch`