Voice Acoustic Analyzer - Professional audio metrics extraction

These details have not been verified by PyPI

Project links

Project description

Audio Metrics CLI v4

🎙️ Industrial-Grade Speech Deep Analysis Platform

v4.0 Architecture: GPU-accelerated, chunked processing, Pydantic-validated industrial analysis

🇨🇳 中国区用户 - 首次使用必读

🚀 一键下载模型（推荐）

Windows 用户: 双击运行 download_models.bat 脚本

或手动执行（PowerShell）:

$env:HF_ENDPOINT = "https://hf-mirror.com"
pip install huggingface-hub openai-whisper -i https://pypi.tuna.tsinghua.edu.cn/simple
cd C:\Users\clawbot\.cache\torch\hub
git clone https://ghproxy.com/https://github.com/snakers4/silero-vad.git silero-vad_master
huggingface-cli download pyannote/speaker-diarization-3.1 --local-dir "C:\Users\clawbot\.cache\huggingface\hub\models--pyannote--speaker-diarization-3.1"
python -c "import whisper; whisper.load_model('base')"

详细说明: 查看 docs/MODEL_DEPENDENCIES.md

🚀 Quick Start

# Install from PyPI (recommended)
pip install audio-metrics-cli

# Full V4 analysis with GPU auto-detection
audio-metrics analyze audio.wav -o result.json

# Specify device manually
audio-metrics analyze audio.wav -d cuda -o result.json

# Long audio (>1h) with custom chunk size
audio-metrics analyze audio.wav -o result.json --chunk-size 900 --show-timings

GPU Acceleration

V4 auto-detects NVIDIA GPUs and runs Whisper + pyannote.audio on CUDA:

audio-metrics analyze audio.wav -d auto   # GPU if available, else CPU
audio-metrics analyze audio.wav -d cuda     # Force GPU
audio-metrics analyze audio.wav -d cpu     # Force CPU

🏛️ Architecture v4

┌──────────────────────────────────────────────────────┐
│               CLI Layer (main_cli.py)                │
│   analyze | analyze-multi | voice-acoustic | serve   │
└──────────────────────────────────────────────────┬───┘
                                                   ↓
┌──────────────────────────────────────────────────────┐
│          V4 Pipeline (v4/pipeline.py)               │
│  DeviceManager → AudioHealth → Chunker → Analyzer  │
└──────────────────────────────────────────────────┬───┘
                                                   ↓
┌──────────────────────────────────────────────────────┐
│          Pydantic Schema (v4/schemas.py)             │
│  V4Result → SegmentModel → SpeakerModel → NER      │
└──────────────────────────────────────────────────────┘

Key Features

Feature	Description
GPU Auto-Detection	Automatic CUDA detection for Whisper + pyannote.audio
Chunked Processing	Handles 1h+ audio without OOM (1800s chunks, 60s overlap)
Word-Level Alignment	Precise timestamp alignment (replaces `seg_duration*5` estimation)
30+ Prosody Metrics	Pitch, energy, spectral, voice quality, speech rate per segment
Fluency Analysis	Filler words (呃/嗯/那个) + unnatural pauses detection
NER	spaCy-based named entity recognition (commercial entities, persons, locations)
Topic Segmentation	Semantic topic chapters with Jaccard keyword similarity
Sentiment & Key Points	TextBlob/snownlp sentiment scoring, automatic key point detection
Pydantic Validation	All outputs validated against strict schema (100% constraint enforcement)
tqdm Progress Bars	Real-time feedback on VAD, Diarization, STT, metrics extraction

📖 CLI Commands

`analyze` - V4 Full Analysis

Single audio file → V4 pipeline with full feature set.

audio-metrics analyze AUDIO_FILE [OPTIONS]

Options:
  -o, --output PATH           Output JSON file path
  -d, --device [auto|cuda|cpu]  Device for inference (default: auto)
  -m, --model TEXT            Whisper model (tiny/base/small/medium/large)
  --num-speakers INTEGER       Number of speakers (if known)
  --min-speakers INTEGER       Minimum number of speakers
  --max-speakers INTEGER       Maximum number of speakers
  --language TEXT              Language code (auto-detect if not specified)
  --chunk-size INTEGER         Chunk size in seconds for long audio (default: 1800)
  --no-emotion                 Skip emotion analysis
  --no-progress                Disable tqdm progress bars
  --show-timings               Show step timing information
  --show-progress              Show progress bars
  -f, --format [json|csv|html]  Output format (default: json)
  --parallel                   Use parallel processing (batch mode)
  --batch PATH                 Process all audio files in directory
  --glob TEXT                  Glob pattern for batch processing
  -j, --workers INTEGER        Number of parallel workers
  -v, --verbose                Verbose output

Examples:
  audio-metrics analyze meeting.wav -o result.json
  audio-metrics analyze meeting.wav -d cuda -o result.json --show-timings
  audio-metrics analyze long_recording.wav --chunk-size 900 --language zh

`analyze-multi` - Multi-Speaker Conversation

audio-metrics analyze-multi AUDIO_FILE [OPTIONS]

`voice-acoustic` - Acoustic Features Only

audio-metrics voice-acoustic AUDIO_FILE [OPTIONS]

`transcribe` - Whisper Transcription Only

audio-metrics transcribe AUDIO_FILE [-o OUTPUT] [-m MODEL] [--language LANG]

`compare` - Compare Two Audio Files

audio-metrics compare FILE1 FILE2 [--format text|json|markdown]

`serve` - Start API Server

audio-metrics serve [--host HOST] [-p PORT] [--reload]

📊 V4 Output Schema

All outputs are Pydantic-validated JSON with strict constraints.

Top-Level Structure

{
  "meta": {
    "version": "4.0.0",
    "device_used": "cuda",
    "chunked_processing": false,
    "analysis_complete": true
  },
  "audio": { ... },
  "speakers": [ ... ],
  "segments": [ ... ],
  "prosody": { ... },
  "fluency": { ... },
  "conversation_dynamics": { ... },
  "vad": { ... },
  "emotion": { ... },
  "named_entities": { ... },
  "topic_segments": [ ... ],
  "transcript_text": "...",
  "transcript_language": "zh"
}

Segment Detail (Core Output Unit)

{
  "segment_index": 0,
  "start": 0.0,
  "end": 15.234,
  "duration": 15.234,
  "confidence": 0.95,
  "speaker": "speaker_0",
  "text": "今天我们讨论一下Aibee项目的进展情况。万象城的项目已经进入第三期。",
  "pitch_mean_hz": 175.3,
  "energy_mean": 0.0245,
  "speech_rate_wpm": 150.2,
  "filler_words": { "那个": 2, "嗯": 1 },
  "sentiment_score": 0.3,
  "is_key_point": true,
  "named_entities": ["Aibee", "万象城"],
  "topic": "project_update"
}

Named Entities

{
  "total_entities": 7,
  "commercial_entities": ["Aibee", "万象城", "中海地产", "保利", "SKP"],
  "persons": ["张三"],
  "organizations": ["Aibee", "中海地产"]
}

Topic Segmentation

{
  "num_topics": 3,
  "topics": [
    { "start": 0.0, "end": 1200.0, "topic_label": "project_update", "keywords": ["项目", "进度", "Aibee", "万象城"], "confidence": 0.85 },
    { "start": 1200.0, "end": 2400.0, "topic_label": "planning", "keywords": ["计划", "目标", "策略"], "confidence": 0.78 }
  ]
}

See standard_v4_sample.json for full reference.

⚠️ Important: Dependencies

This tool requires pyannote.audio for accurate multi-speaker analysis.

Without pyannote.audio installed, the tool uses a fallback VAD-based method that:

❌ Cannot distinguish between different speakers
❌ Will show 50/50 speaking time even when one person talks 90% of the time

With pyannote.audio installed:

✅ Correctly identifies who spoke when
✅ Accurate speaker time statistics
✅ Works with any number of speakers

Installation

# CPU-only (faster install, recommended for testing)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install pyannote.audio

# GPU (faster inference, requires CUDA)
pip install torch torchaudio
pip install pyannote.audio

Optional Dependencies for Full V4 Features

# NER + Sentiment (recommended)
pip install audio-metrics-cli[nlp]

# Individual
pip install audio-metrics-cli[ner]      # spaCy for named entity recognition
pip install audio-metrics-cli[emotion]   # SpeechBrain for emotion analysis
pip install audio-metrics-cli[api]       # FastAPI server

💻 Development

# Clone repository
git clone https://github.com/i-whimsy/audio-metrics-cli.git
cd audio-metrics-cli

# Install with dev dependencies
pip install -e ".[dev]"

# Run V4 tests
pytest tests/v4/ -v

# Run all tests
pytest tests/ -v

# Format code
black src/
ruff check src/

Project Structure

audio-metrics-cli/
├── src/audio_metrics/
│   ├── main_cli.py              # V4 CLI entry point
│   ├── cli/
│   │   ├── __init__.py          # cli/__init__ → main_cli.py
│   │   └── cli.py               # Legacy v3 CLI (superseded)
│   ├── v4/
│   │   ├── __init__.py
│   │   ├── schemas.py           # Pydantic V4 models
│   │   ├── pipeline.py         # V4 orchestrator
│   │   └── generate_sample.py   # Sample generation
│   ├── analyzers/
│   │   ├── audio_health.py      # Audio validation/normalization
│   │   ├── speech_to_text.py    # Word-level timestamps
│   │   ├── speaker_diarization.py  # GPU device support
│   │   ├── prosody_analyzer.py  # 30+ prosody features
│   │   ├── filler_detector.py  # Filler word detection
│   │   ├── fluency_analyzer.py # Unnatural pauses
│   │   └── ...
│   ├── nlp/
│   │   ├── ner_analyzer.py      # spaCy NER
│   │   ├── topic_segmenter.py  # Topic segmentation
│   │   ├── sentiment_analyzer.py  # TextBlob + snownlp
│   │   └── ...
│   ├── core/
│   │   ├── device.py            # GPU/CPU detection
│   │   ├── chunker.py           # Long audio chunking
│   │   ├── warnings.py         # Warning suppression
│   │   └── ...
│   ├── conversation/
│   ├── metrics/
│   └── exporters/
├── tests/
│   └── v4/
│       ├── test_schema_validation.py  # 27 schema tests
│       └── test_edge_cases.py        # 17 boundary tests
├── standard_v4_sample.json     # Reference output
├── pyproject.toml
└── README.md

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

MIT License - see the LICENSE file for details.

🙏 Acknowledgments

OpenAI Whisper - Speech-to-text
Silero VAD - Voice activity detection
pyannote - Speaker diarization
Librosa - Audio analysis
spaCy - Named entity recognition
TextBlob - Sentiment analysis
SnowNLP - Chinese sentiment

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: clawbot@openclaw.ai

Built with ❤️ by OpenClaw Team v4.0 - Industrial-Grade Speech Deep Analysis

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

Mar 29, 2026

0.3.1

Mar 8, 2026

0.2.0

Mar 8, 2026

0.1.0

Mar 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audio_metrics_cli-0.4.0.tar.gz (100.5 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

audio_metrics_cli-0.4.0-py3-none-any.whl (113.4 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file audio_metrics_cli-0.4.0.tar.gz.

File metadata

Download URL: audio_metrics_cli-0.4.0.tar.gz
Upload date: Mar 29, 2026
Size: 100.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for audio_metrics_cli-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`88333dd4d7474baf76495f39ffffc9d0972ae41301c5a635a169debf073575fc`
MD5	`8d9b430acc9399fc8f25c178d5cc2ed3`
BLAKE2b-256	`df44e3153a454a529bce548e7011cfb5f825c2147d4e0e18f48dfb1ebe7932aa`

See more details on using hashes here.

File details

Details for the file audio_metrics_cli-0.4.0-py3-none-any.whl.

File metadata

Download URL: audio_metrics_cli-0.4.0-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 113.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for audio_metrics_cli-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`498d54b18ae324a7ccfcec292de9842060bc72f95e4a975c59ea473b581cc1b1`
MD5	`fba4af1b68197ee700044f46711b51f5`
BLAKE2b-256	`1a3cc40923994085d88b466d340a153952d3f5b892154318e060ec3e0faed2df`

See more details on using hashes here.

audio-metrics-cli 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Audio Metrics CLI v4

🇨🇳 中国区用户 - 首次使用必读

🚀 一键下载模型（推荐）

🚀 Quick Start

GPU Acceleration

🏛️ Architecture v4

Key Features

📖 CLI Commands

analyze - V4 Full Analysis

analyze-multi - Multi-Speaker Conversation

voice-acoustic - Acoustic Features Only

transcribe - Whisper Transcription Only

compare - Compare Two Audio Files

serve - Start API Server

📊 V4 Output Schema

Top-Level Structure

Segment Detail (Core Output Unit)

Named Entities

Topic Segmentation

⚠️ Important: Dependencies

Installation

Optional Dependencies for Full V4 Features

💻 Development

Project Structure

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`analyze` - V4 Full Analysis

`analyze-multi` - Multi-Speaker Conversation

`voice-acoustic` - Acoustic Features Only

`transcribe` - Whisper Transcription Only

`compare` - Compare Two Audio Files

`serve` - Start API Server