Skip to main content

Voice Acoustic Analyzer - Professional audio metrics extraction

Project description

Audio Metrics CLI v4

๐ŸŽ™๏ธ Industrial-Grade Speech Deep Analysis Platform

PyPI version Python 3.8+ License: MIT

v4.0 Architecture: GPU-accelerated, chunked processing, Pydantic-validated industrial analysis


๐Ÿ‡จ๐Ÿ‡ณ ไธญๅ›ฝๅŒบ็”จๆˆท - ้ฆ–ๆฌกไฝฟ็”จๅฟ…่ฏป

๐Ÿš€ ไธ€้”ฎไธ‹่ฝฝๆจกๅž‹๏ผˆๆŽจ่๏ผ‰

Windows ็”จๆˆท: ๅŒๅ‡ป่ฟ่กŒ download_models.bat ่„šๆœฌ

ๆˆ–ๆ‰‹ๅŠจๆ‰ง่กŒ๏ผˆPowerShell๏ผ‰:

$env:HF_ENDPOINT = "https://hf-mirror.com"
pip install huggingface-hub openai-whisper -i https://pypi.tuna.tsinghua.edu.cn/simple
cd C:\Users\clawbot\.cache\torch\hub
git clone https://ghproxy.com/https://github.com/snakers4/silero-vad.git silero-vad_master
huggingface-cli download pyannote/speaker-diarization-3.1 --local-dir "C:\Users\clawbot\.cache\huggingface\hub\models--pyannote--speaker-diarization-3.1"
python -c "import whisper; whisper.load_model('base')"

่ฏฆ็ป†่ฏดๆ˜Ž: ๆŸฅ็œ‹ docs/MODEL_DEPENDENCIES.md


๐Ÿš€ Quick Start

# Install from PyPI (recommended)
pip install audio-metrics-cli

# Full V4 analysis with GPU auto-detection
audio-metrics analyze audio.wav -o result.json

# Specify device manually
audio-metrics analyze audio.wav -d cuda -o result.json

# Long audio (>1h) with custom chunk size
audio-metrics analyze audio.wav -o result.json --chunk-size 900 --show-timings

GPU Acceleration

V4 auto-detects NVIDIA GPUs and runs Whisper + pyannote.audio on CUDA:

audio-metrics analyze audio.wav -d auto   # GPU if available, else CPU
audio-metrics analyze audio.wav -d cuda     # Force GPU
audio-metrics analyze audio.wav -d cpu     # Force CPU

๐Ÿ›๏ธ Architecture v4

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚               CLI Layer (main_cli.py)                โ”‚
โ”‚   analyze | analyze-multi | voice-acoustic | serve   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
                                                   โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          V4 Pipeline (v4/pipeline.py)               โ”‚
โ”‚  DeviceManager โ†’ AudioHealth โ†’ Chunker โ†’ Analyzer  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
                                                   โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          Pydantic Schema (v4/schemas.py)             โ”‚
โ”‚  V4Result โ†’ SegmentModel โ†’ SpeakerModel โ†’ NER      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Features

Feature Description
GPU Auto-Detection Automatic CUDA detection for Whisper + pyannote.audio
Chunked Processing Handles 1h+ audio without OOM (1800s chunks, 60s overlap)
Word-Level Alignment Precise timestamp alignment (replaces seg_duration*5 estimation)
30+ Prosody Metrics Pitch, energy, spectral, voice quality, speech rate per segment
Fluency Analysis Filler words (ๅ‘ƒ/ๅ—ฏ/้‚ฃไธช) + unnatural pauses detection
NER spaCy-based named entity recognition (commercial entities, persons, locations)
Topic Segmentation Semantic topic chapters with Jaccard keyword similarity
Sentiment & Key Points TextBlob/snownlp sentiment scoring, automatic key point detection
Pydantic Validation All outputs validated against strict schema (100% constraint enforcement)
tqdm Progress Bars Real-time feedback on VAD, Diarization, STT, metrics extraction

๐Ÿ“– CLI Commands

analyze - V4 Full Analysis

Single audio file โ†’ V4 pipeline with full feature set.

audio-metrics analyze AUDIO_FILE [OPTIONS]

Options:
  -o, --output PATH           Output JSON file path
  -d, --device [auto|cuda|cpu]  Device for inference (default: auto)
  -m, --model TEXT            Whisper model (tiny/base/small/medium/large)
  --num-speakers INTEGER       Number of speakers (if known)
  --min-speakers INTEGER       Minimum number of speakers
  --max-speakers INTEGER       Maximum number of speakers
  --language TEXT              Language code (auto-detect if not specified)
  --chunk-size INTEGER         Chunk size in seconds for long audio (default: 1800)
  --no-emotion                 Skip emotion analysis
  --no-progress                Disable tqdm progress bars
  --show-timings               Show step timing information
  --show-progress              Show progress bars
  -f, --format [json|csv|html]  Output format (default: json)
  --parallel                   Use parallel processing (batch mode)
  --batch PATH                 Process all audio files in directory
  --glob TEXT                  Glob pattern for batch processing
  -j, --workers INTEGER        Number of parallel workers
  -v, --verbose                Verbose output

Examples:
  audio-metrics analyze meeting.wav -o result.json
  audio-metrics analyze meeting.wav -d cuda -o result.json --show-timings
  audio-metrics analyze long_recording.wav --chunk-size 900 --language zh

analyze-multi - Multi-Speaker Conversation

audio-metrics analyze-multi AUDIO_FILE [OPTIONS]

voice-acoustic - Acoustic Features Only

audio-metrics voice-acoustic AUDIO_FILE [OPTIONS]

transcribe - Whisper Transcription Only

audio-metrics transcribe AUDIO_FILE [-o OUTPUT] [-m MODEL] [--language LANG]

compare - Compare Two Audio Files

audio-metrics compare FILE1 FILE2 [--format text|json|markdown]

serve - Start API Server

audio-metrics serve [--host HOST] [-p PORT] [--reload]

๐Ÿ“Š V4 Output Schema

All outputs are Pydantic-validated JSON with strict constraints.

Top-Level Structure

{
  "meta": {
    "version": "4.0.0",
    "device_used": "cuda",
    "chunked_processing": false,
    "analysis_complete": true
  },
  "audio": { ... },
  "speakers": [ ... ],
  "segments": [ ... ],
  "prosody": { ... },
  "fluency": { ... },
  "conversation_dynamics": { ... },
  "vad": { ... },
  "emotion": { ... },
  "named_entities": { ... },
  "topic_segments": [ ... ],
  "transcript_text": "...",
  "transcript_language": "zh"
}

Segment Detail (Core Output Unit)

{
  "segment_index": 0,
  "start": 0.0,
  "end": 15.234,
  "duration": 15.234,
  "confidence": 0.95,
  "speaker": "speaker_0",
  "text": "ไปŠๅคฉๆˆ‘ไปฌ่ฎจ่ฎบไธ€ไธ‹Aibee้กน็›ฎ็š„่ฟ›ๅฑ•ๆƒ…ๅ†ตใ€‚ไธ‡่ฑกๅŸŽ็š„้กน็›ฎๅทฒ็ป่ฟ›ๅ…ฅ็ฌฌไธ‰ๆœŸใ€‚",
  "pitch_mean_hz": 175.3,
  "energy_mean": 0.0245,
  "speech_rate_wpm": 150.2,
  "filler_words": { "้‚ฃไธช": 2, "ๅ—ฏ": 1 },
  "sentiment_score": 0.3,
  "is_key_point": true,
  "named_entities": ["Aibee", "ไธ‡่ฑกๅŸŽ"],
  "topic": "project_update"
}

Named Entities

{
  "total_entities": 7,
  "commercial_entities": ["Aibee", "ไธ‡่ฑกๅŸŽ", "ไธญๆตทๅœฐไบง", "ไฟๅˆฉ", "SKP"],
  "persons": ["ๅผ ไธ‰"],
  "organizations": ["Aibee", "ไธญๆตทๅœฐไบง"]
}

Topic Segmentation

{
  "num_topics": 3,
  "topics": [
    { "start": 0.0, "end": 1200.0, "topic_label": "project_update", "keywords": ["้กน็›ฎ", "่ฟ›ๅบฆ", "Aibee", "ไธ‡่ฑกๅŸŽ"], "confidence": 0.85 },
    { "start": 1200.0, "end": 2400.0, "topic_label": "planning", "keywords": ["่ฎกๅˆ’", "็›ฎๆ ‡", "็ญ–็•ฅ"], "confidence": 0.78 }
  ]
}

See standard_v4_sample.json for full reference.


โš ๏ธ Important: Dependencies

This tool requires pyannote.audio for accurate multi-speaker analysis.

Without pyannote.audio installed, the tool uses a fallback VAD-based method that:

  • โŒ Cannot distinguish between different speakers
  • โŒ Will show 50/50 speaking time even when one person talks 90% of the time

With pyannote.audio installed:

  • โœ… Correctly identifies who spoke when
  • โœ… Accurate speaker time statistics
  • โœ… Works with any number of speakers

Installation

# CPU-only (faster install, recommended for testing)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install pyannote.audio

# GPU (faster inference, requires CUDA)
pip install torch torchaudio
pip install pyannote.audio

Optional Dependencies for Full V4 Features

# NER + Sentiment (recommended)
pip install audio-metrics-cli[nlp]

# Individual
pip install audio-metrics-cli[ner]      # spaCy for named entity recognition
pip install audio-metrics-cli[emotion]   # SpeechBrain for emotion analysis
pip install audio-metrics-cli[api]       # FastAPI server

๐Ÿ’ป Development

# Clone repository
git clone https://github.com/i-whimsy/audio-metrics-cli.git
cd audio-metrics-cli

# Install with dev dependencies
pip install -e ".[dev]"

# Run V4 tests
pytest tests/v4/ -v

# Run all tests
pytest tests/ -v

# Format code
black src/
ruff check src/

Project Structure

audio-metrics-cli/
โ”œโ”€โ”€ src/audio_metrics/
โ”‚   โ”œโ”€โ”€ main_cli.py              # V4 CLI entry point
โ”‚   โ”œโ”€โ”€ cli/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py          # cli/__init__ โ†’ main_cli.py
โ”‚   โ”‚   โ””โ”€โ”€ cli.py               # Legacy v3 CLI (superseded)
โ”‚   โ”œโ”€โ”€ v4/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ schemas.py           # Pydantic V4 models
โ”‚   โ”‚   โ”œโ”€โ”€ pipeline.py         # V4 orchestrator
โ”‚   โ”‚   โ””โ”€โ”€ generate_sample.py   # Sample generation
โ”‚   โ”œโ”€โ”€ analyzers/
โ”‚   โ”‚   โ”œโ”€โ”€ audio_health.py      # Audio validation/normalization
โ”‚   โ”‚   โ”œโ”€โ”€ speech_to_text.py    # Word-level timestamps
โ”‚   โ”‚   โ”œโ”€โ”€ speaker_diarization.py  # GPU device support
โ”‚   โ”‚   โ”œโ”€โ”€ prosody_analyzer.py  # 30+ prosody features
โ”‚   โ”‚   โ”œโ”€โ”€ filler_detector.py  # Filler word detection
โ”‚   โ”‚   โ”œโ”€โ”€ fluency_analyzer.py # Unnatural pauses
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ nlp/
โ”‚   โ”‚   โ”œโ”€โ”€ ner_analyzer.py      # spaCy NER
โ”‚   โ”‚   โ”œโ”€โ”€ topic_segmenter.py  # Topic segmentation
โ”‚   โ”‚   โ”œโ”€โ”€ sentiment_analyzer.py  # TextBlob + snownlp
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ device.py            # GPU/CPU detection
โ”‚   โ”‚   โ”œโ”€โ”€ chunker.py           # Long audio chunking
โ”‚   โ”‚   โ”œโ”€โ”€ warnings.py         # Warning suppression
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ conversation/
โ”‚   โ”œโ”€โ”€ metrics/
โ”‚   โ””โ”€โ”€ exporters/
โ”œโ”€โ”€ tests/
โ”‚   โ””โ”€โ”€ v4/
โ”‚       โ”œโ”€โ”€ test_schema_validation.py  # 27 schema tests
โ”‚       โ””โ”€โ”€ test_edge_cases.py        # 17 boundary tests
โ”œโ”€โ”€ standard_v4_sample.json     # Reference output
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ README.md

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“ License

MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments


๐Ÿ“ž Support


Built with โค๏ธ by OpenClaw Team v4.0 - Industrial-Grade Speech Deep Analysis

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audio_metrics_cli-0.4.0.tar.gz (100.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

audio_metrics_cli-0.4.0-py3-none-any.whl (113.4 kB view details)

Uploaded Python 3

File details

Details for the file audio_metrics_cli-0.4.0.tar.gz.

File metadata

  • Download URL: audio_metrics_cli-0.4.0.tar.gz
  • Upload date:
  • Size: 100.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for audio_metrics_cli-0.4.0.tar.gz
Algorithm Hash digest
SHA256 88333dd4d7474baf76495f39ffffc9d0972ae41301c5a635a169debf073575fc
MD5 8d9b430acc9399fc8f25c178d5cc2ed3
BLAKE2b-256 df44e3153a454a529bce548e7011cfb5f825c2147d4e0e18f48dfb1ebe7932aa

See more details on using hashes here.

File details

Details for the file audio_metrics_cli-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for audio_metrics_cli-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 498d54b18ae324a7ccfcec292de9842060bc72f95e4a975c59ea473b581cc1b1
MD5 fba4af1b68197ee700044f46711b51f5
BLAKE2b-256 1a3cc40923994085d88b466d340a153952d3f5b892154318e060ec3e0faed2df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page