Voice Acoustic Analyzer - Professional audio metrics extraction
Project description
Audio Metrics CLI v4
๐๏ธ Industrial-Grade Speech Deep Analysis Platform
v4.0 Architecture: GPU-accelerated, chunked processing, Pydantic-validated industrial analysis
๐จ๐ณ ไธญๅฝๅบ็จๆท - ้ฆๆฌกไฝฟ็จๅฟ ่ฏป
๐ ไธ้ฎไธ่ฝฝๆจกๅ๏ผๆจ่๏ผ
Windows ็จๆท: ๅๅป่ฟ่ก download_models.bat ่ๆฌ
ๆๆๅจๆง่ก๏ผPowerShell๏ผ:
$env:HF_ENDPOINT = "https://hf-mirror.com"
pip install huggingface-hub openai-whisper -i https://pypi.tuna.tsinghua.edu.cn/simple
cd C:\Users\clawbot\.cache\torch\hub
git clone https://ghproxy.com/https://github.com/snakers4/silero-vad.git silero-vad_master
huggingface-cli download pyannote/speaker-diarization-3.1 --local-dir "C:\Users\clawbot\.cache\huggingface\hub\models--pyannote--speaker-diarization-3.1"
python -c "import whisper; whisper.load_model('base')"
่ฏฆ็ป่ฏดๆ: ๆฅ็ docs/MODEL_DEPENDENCIES.md
๐ Quick Start
# Install from PyPI (recommended)
pip install audio-metrics-cli
# Full V4 analysis with GPU auto-detection
audio-metrics analyze audio.wav -o result.json
# Specify device manually
audio-metrics analyze audio.wav -d cuda -o result.json
# Long audio (>1h) with custom chunk size
audio-metrics analyze audio.wav -o result.json --chunk-size 900 --show-timings
GPU Acceleration
V4 auto-detects NVIDIA GPUs and runs Whisper + pyannote.audio on CUDA:
audio-metrics analyze audio.wav -d auto # GPU if available, else CPU
audio-metrics analyze audio.wav -d cuda # Force GPU
audio-metrics analyze audio.wav -d cpu # Force CPU
๐๏ธ Architecture v4
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLI Layer (main_cli.py) โ
โ analyze | analyze-multi | voice-acoustic | serve โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ V4 Pipeline (v4/pipeline.py) โ
โ DeviceManager โ AudioHealth โ Chunker โ Analyzer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Pydantic Schema (v4/schemas.py) โ
โ V4Result โ SegmentModel โ SpeakerModel โ NER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Features
| Feature | Description |
|---|---|
| GPU Auto-Detection | Automatic CUDA detection for Whisper + pyannote.audio |
| Chunked Processing | Handles 1h+ audio without OOM (1800s chunks, 60s overlap) |
| Word-Level Alignment | Precise timestamp alignment (replaces seg_duration*5 estimation) |
| 30+ Prosody Metrics | Pitch, energy, spectral, voice quality, speech rate per segment |
| Fluency Analysis | Filler words (ๅ/ๅฏ/้ฃไธช) + unnatural pauses detection |
| NER | spaCy-based named entity recognition (commercial entities, persons, locations) |
| Topic Segmentation | Semantic topic chapters with Jaccard keyword similarity |
| Sentiment & Key Points | TextBlob/snownlp sentiment scoring, automatic key point detection |
| Pydantic Validation | All outputs validated against strict schema (100% constraint enforcement) |
| tqdm Progress Bars | Real-time feedback on VAD, Diarization, STT, metrics extraction |
๐ CLI Commands
analyze - V4 Full Analysis
Single audio file โ V4 pipeline with full feature set.
audio-metrics analyze AUDIO_FILE [OPTIONS]
Options:
-o, --output PATH Output JSON file path
-d, --device [auto|cuda|cpu] Device for inference (default: auto)
-m, --model TEXT Whisper model (tiny/base/small/medium/large)
--num-speakers INTEGER Number of speakers (if known)
--min-speakers INTEGER Minimum number of speakers
--max-speakers INTEGER Maximum number of speakers
--language TEXT Language code (auto-detect if not specified)
--chunk-size INTEGER Chunk size in seconds for long audio (default: 1800)
--no-emotion Skip emotion analysis
--no-progress Disable tqdm progress bars
--show-timings Show step timing information
--show-progress Show progress bars
-f, --format [json|csv|html] Output format (default: json)
--parallel Use parallel processing (batch mode)
--batch PATH Process all audio files in directory
--glob TEXT Glob pattern for batch processing
-j, --workers INTEGER Number of parallel workers
-v, --verbose Verbose output
Examples:
audio-metrics analyze meeting.wav -o result.json
audio-metrics analyze meeting.wav -d cuda -o result.json --show-timings
audio-metrics analyze long_recording.wav --chunk-size 900 --language zh
analyze-multi - Multi-Speaker Conversation
audio-metrics analyze-multi AUDIO_FILE [OPTIONS]
voice-acoustic - Acoustic Features Only
audio-metrics voice-acoustic AUDIO_FILE [OPTIONS]
transcribe - Whisper Transcription Only
audio-metrics transcribe AUDIO_FILE [-o OUTPUT] [-m MODEL] [--language LANG]
compare - Compare Two Audio Files
audio-metrics compare FILE1 FILE2 [--format text|json|markdown]
serve - Start API Server
audio-metrics serve [--host HOST] [-p PORT] [--reload]
๐ V4 Output Schema
All outputs are Pydantic-validated JSON with strict constraints.
Top-Level Structure
{
"meta": {
"version": "4.0.0",
"device_used": "cuda",
"chunked_processing": false,
"analysis_complete": true
},
"audio": { ... },
"speakers": [ ... ],
"segments": [ ... ],
"prosody": { ... },
"fluency": { ... },
"conversation_dynamics": { ... },
"vad": { ... },
"emotion": { ... },
"named_entities": { ... },
"topic_segments": [ ... ],
"transcript_text": "...",
"transcript_language": "zh"
}
Segment Detail (Core Output Unit)
{
"segment_index": 0,
"start": 0.0,
"end": 15.234,
"duration": 15.234,
"confidence": 0.95,
"speaker": "speaker_0",
"text": "ไปๅคฉๆไปฌ่ฎจ่ฎบไธไธAibee้กน็ฎ็่ฟๅฑๆ
ๅตใไธ่ฑกๅ็้กน็ฎๅทฒ็ป่ฟๅ
ฅ็ฌฌไธๆใ",
"pitch_mean_hz": 175.3,
"energy_mean": 0.0245,
"speech_rate_wpm": 150.2,
"filler_words": { "้ฃไธช": 2, "ๅฏ": 1 },
"sentiment_score": 0.3,
"is_key_point": true,
"named_entities": ["Aibee", "ไธ่ฑกๅ"],
"topic": "project_update"
}
Named Entities
{
"total_entities": 7,
"commercial_entities": ["Aibee", "ไธ่ฑกๅ", "ไธญๆตทๅฐไบง", "ไฟๅฉ", "SKP"],
"persons": ["ๅผ ไธ"],
"organizations": ["Aibee", "ไธญๆตทๅฐไบง"]
}
Topic Segmentation
{
"num_topics": 3,
"topics": [
{ "start": 0.0, "end": 1200.0, "topic_label": "project_update", "keywords": ["้กน็ฎ", "่ฟๅบฆ", "Aibee", "ไธ่ฑกๅ"], "confidence": 0.85 },
{ "start": 1200.0, "end": 2400.0, "topic_label": "planning", "keywords": ["่ฎกๅ", "็ฎๆ ", "็ญ็ฅ"], "confidence": 0.78 }
]
}
See standard_v4_sample.json for full reference.
โ ๏ธ Important: Dependencies
This tool requires pyannote.audio for accurate multi-speaker analysis.
Without pyannote.audio installed, the tool uses a fallback VAD-based method that:
- โ Cannot distinguish between different speakers
- โ Will show 50/50 speaking time even when one person talks 90% of the time
With pyannote.audio installed:
- โ Correctly identifies who spoke when
- โ Accurate speaker time statistics
- โ Works with any number of speakers
Installation
# CPU-only (faster install, recommended for testing)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install pyannote.audio
# GPU (faster inference, requires CUDA)
pip install torch torchaudio
pip install pyannote.audio
Optional Dependencies for Full V4 Features
# NER + Sentiment (recommended)
pip install audio-metrics-cli[nlp]
# Individual
pip install audio-metrics-cli[ner] # spaCy for named entity recognition
pip install audio-metrics-cli[emotion] # SpeechBrain for emotion analysis
pip install audio-metrics-cli[api] # FastAPI server
๐ป Development
# Clone repository
git clone https://github.com/i-whimsy/audio-metrics-cli.git
cd audio-metrics-cli
# Install with dev dependencies
pip install -e ".[dev]"
# Run V4 tests
pytest tests/v4/ -v
# Run all tests
pytest tests/ -v
# Format code
black src/
ruff check src/
Project Structure
audio-metrics-cli/
โโโ src/audio_metrics/
โ โโโ main_cli.py # V4 CLI entry point
โ โโโ cli/
โ โ โโโ __init__.py # cli/__init__ โ main_cli.py
โ โ โโโ cli.py # Legacy v3 CLI (superseded)
โ โโโ v4/
โ โ โโโ __init__.py
โ โ โโโ schemas.py # Pydantic V4 models
โ โ โโโ pipeline.py # V4 orchestrator
โ โ โโโ generate_sample.py # Sample generation
โ โโโ analyzers/
โ โ โโโ audio_health.py # Audio validation/normalization
โ โ โโโ speech_to_text.py # Word-level timestamps
โ โ โโโ speaker_diarization.py # GPU device support
โ โ โโโ prosody_analyzer.py # 30+ prosody features
โ โ โโโ filler_detector.py # Filler word detection
โ โ โโโ fluency_analyzer.py # Unnatural pauses
โ โ โโโ ...
โ โโโ nlp/
โ โ โโโ ner_analyzer.py # spaCy NER
โ โ โโโ topic_segmenter.py # Topic segmentation
โ โ โโโ sentiment_analyzer.py # TextBlob + snownlp
โ โ โโโ ...
โ โโโ core/
โ โ โโโ device.py # GPU/CPU detection
โ โ โโโ chunker.py # Long audio chunking
โ โ โโโ warnings.py # Warning suppression
โ โ โโโ ...
โ โโโ conversation/
โ โโโ metrics/
โ โโโ exporters/
โโโ tests/
โ โโโ v4/
โ โโโ test_schema_validation.py # 27 schema tests
โ โโโ test_edge_cases.py # 17 boundary tests
โโโ standard_v4_sample.json # Reference output
โโโ pyproject.toml
โโโ README.md
๐ค Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ License
MIT License - see the LICENSE file for details.
๐ Acknowledgments
- OpenAI Whisper - Speech-to-text
- Silero VAD - Voice activity detection
- pyannote - Speaker diarization
- Librosa - Audio analysis
- spaCy - Named entity recognition
- TextBlob - Sentiment analysis
- SnowNLP - Chinese sentiment
๐ Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: clawbot@openclaw.ai
Built with โค๏ธ by OpenClaw Team v4.0 - Industrial-Grade Speech Deep Analysis
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file audio_metrics_cli-0.4.0.tar.gz.
File metadata
- Download URL: audio_metrics_cli-0.4.0.tar.gz
- Upload date:
- Size: 100.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88333dd4d7474baf76495f39ffffc9d0972ae41301c5a635a169debf073575fc
|
|
| MD5 |
8d9b430acc9399fc8f25c178d5cc2ed3
|
|
| BLAKE2b-256 |
df44e3153a454a529bce548e7011cfb5f825c2147d4e0e18f48dfb1ebe7932aa
|
File details
Details for the file audio_metrics_cli-0.4.0-py3-none-any.whl.
File metadata
- Download URL: audio_metrics_cli-0.4.0-py3-none-any.whl
- Upload date:
- Size: 113.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
498d54b18ae324a7ccfcec292de9842060bc72f95e4a975c59ea473b581cc1b1
|
|
| MD5 |
fba4af1b68197ee700044f46711b51f5
|
|
| BLAKE2b-256 |
1a3cc40923994085d88b466d340a153952d3f5b892154318e060ec3e0faed2df
|