ASR Quality Enhancement Layer for Parakeet Multilingual ASR
Project description
ASR Quality Enhancement Layer
A production-grade post-processing pipeline for improving Parakeet Multilingual ASR outputs. This system addresses common ASR challenges including low-confidence word detection, numeric sequence reconstruction, domain vocabulary correction, and LLM-based contextual polishing.
๐ฏ Overview
The ASR Enhancement Layer sits between the Parakeet ASR engine and downstream applications, providing:
- Error Detection: Identifies low-confidence spans, anomalies, and incomplete sequences
- Secondary ASR: Re-transcribes problematic segments using Whisper/Riva
- Numeric Reconstruction: Recovers missing digits in phone numbers, OTPs, amounts
- Domain Vocabulary: Applies domain-specific terminology corrections
- LLM Polishing: Fixes grammar and coherence with anti-hallucination safeguards
- Hypothesis Fusion: Combines multiple ASR outputs using weighted scoring
๐ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ASR QUALITY ENHANCEMENT LAYER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Parakeet โโโโโถโ Error โโโโโถโ Re-ASR โ โ
โ โ ASR Input โ โ Detection โ โ Processing โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โ โโโโโโโโดโโโโโโโ โ โ
โ โ โ โข Confidence โ โ โ
โ โ โ โข Anomalies โ โ โ
โ โ โ โข Numeric โ โ โ
โ โ โโโโโโโโโโโโโโโ โ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Numeric โโโโโถโ Domain โโโโโถโ Hypothesis โ โ
โ โ Reconstruct โ โ Vocab โ โ Fusion โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โ โโโโโโโโดโโโโโโโ โ โ
โ โ โ โข Lexicons โ โ โ
โ โ โ โข Fuzzy Matchโ โ โ
โ โ โ โข Phonetic โ โ โ
โ โ โโโโโโโโโโโโโโโ โ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ LLM โโโโโถโ Validation โโโโโถโ Enhanced โ โ
โ โ Polishing โ โ & Scoring โ โ Output โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโดโโโโโโโ โ
โ โ โ โข Consistencyโ โ
โ โ โ โข Perplexity โ โ
โ โ โ โข Completenessโ โ
โ โ โโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Project Structure
asr_enhancer/
โโโ __init__.py # Package exports
โโโ core.py # Main EnhancementPipeline orchestrator
โโโ detectors/ # Error detection modules
โ โโโ confidence_detector.py # Low-confidence span detection
โ โโโ anomaly_detector.py # Segmentation/repetition anomalies
โ โโโ numeric_gap_detector.py # Incomplete number sequences
โโโ resynthesis/ # Secondary ASR processing
โ โโโ segment_extractor.py # Audio segment extraction
โ โโโ secondary_asr.py # ASR backend abstraction
โ โโโ whisper_backend.py # Whisper integration
โ โโโ riva_backend.py # NVIDIA Riva integration
โโโ numeric/ # Numeric reconstruction
โ โโโ pattern_analyzer.py # Number pattern detection
โ โโโ sequence_reconstructor.py # Digit recovery
โ โโโ validators.py # Phone/OTP/card validation
โโโ vocab/ # Domain vocabulary
โ โโโ lexicon_loader.py # Lexicon loading
โ โโโ term_matcher.py # Term matching (fuzzy/phonetic)
โ โโโ corrector.py # Vocabulary correction
โโโ llm/ # LLM integration
โ โโโ context_restorer.py # Main LLM processor
โ โโโ prompt_templates.py # Anti-hallucination prompts
โ โโโ providers.py # OpenAI/Ollama/Anthropic
โโโ fusion/ # Hypothesis fusion
โ โโโ fusion_engine.py # N-best combination
โ โโโ scorers.py # Acoustic/LM scoring
โ โโโ selector.py # Candidate selection
โโโ validators/ # Quality validation
โ โโโ consistency_checker.py # Content consistency
โ โโโ perplexity_scorer.py # Fluency scoring
โ โโโ completeness_validator.py # Gap detection
โโโ utils/ # Utilities
โ โโโ config.py # Configuration management
โ โโโ logging.py # Structured logging
โ โโโ audio.py # Audio utilities
โ โโโ text.py # Text utilities
โโโ api/ # FastAPI service
โ โโโ main.py # Application entry
โ โโโ routes.py # API endpoints
โ โโโ schemas.py # Pydantic models
โโโ cli/ # Command-line interface
โโโ __init__.py # CLI commands
๐ Quick Start
Installation
# Clone the repository
git clone <repository-url>
cd sound-web
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -e ".[all]"
Basic Usage
from asr_enhancer import EnhancementPipeline
from asr_enhancer.utils import Config
# Initialize pipeline
config = Config(
confidence_threshold=0.7,
llm_provider="ollama",
llm_model="llama3.1",
)
pipeline = EnhancementPipeline(config)
# Enhance transcript
result = await pipeline.enhance(
transcript="my phone number is nine one two tree four five six seven ate nine",
word_timestamps=[
{"word": "my", "start": 0.0, "end": 0.2},
{"word": "phone", "start": 0.2, "end": 0.5},
# ... more timestamps
],
word_confidences=[0.95, 0.92, 0.89, 0.98, 0.85, 0.91, 0.88, 0.45, 0.92, 0.87, 0.90, 0.93, 0.38, 0.91],
)
print(f"Enhanced: {result.enhanced_transcript}")
print(f"Confidence improvement: {result.confidence_improvement:.2%}")
API Server
# Start the API server
asr-enhancer serve --host 0.0.0.0 --port 8000
# Or with Docker
docker-compose up -d
CLI Usage
# Enhance a transcript file
asr-enhancer enhance input.json -o output.json --format json
# Analyze without enhancement
asr-enhancer analyze input.json
# Check dependencies
asr-enhancer check
๐ API Endpoints
POST /api/v1/enhance
Enhance a transcript using the full pipeline.
{
"transcript": "raw transcript text",
"word_timestamps": [{"word": "...", "start": 0.0, "end": 0.1}],
"word_confidences": [0.9, 0.8, ...],
"audio_path": "/path/to/audio.wav", // optional
"domain_lexicon": {"term": ["variant1", "variant2"]} // optional
}
POST /api/v1/analyze
Analyze transcript without enhancement.
GET /api/v1/diagnostics
Get pipeline diagnostics and configuration.
GET /health
Health check endpoint.
โ๏ธ Configuration
Configuration can be set via:
- Configuration file (
config.json) - Environment variables
- Code
Key Settings
| Setting | Default | Description |
|---|---|---|
confidence_threshold |
0.7 | Threshold for low-confidence detection |
sliding_window_size |
3 | Window size for confidence smoothing |
secondary_asr_backend |
"whisper" | Backend for re-ASR ("whisper", "riva") |
llm_provider |
"ollama" | LLM provider ("openai", "ollama", "anthropic") |
llm_model |
"llama3.1" | LLM model name |
fusion_alpha |
0.4 | Weight for original ASR confidence |
fusion_beta |
0.35 | Weight for language model score |
fusion_gamma |
0.25 | Weight for acoustic similarity |
Environment Variables
export ASR_CONFIDENCE_THRESHOLD=0.7
export ASR_LLM_PROVIDER=ollama
export ASR_LLM_MODEL=llama3.1
export ASR_LLM_API_KEY=your-api-key # For OpenAI/Anthropic
export ASR_LOG_LEVEL=INFO
๐ Fusion Formula
The hypothesis fusion uses weighted scoring:
$$Score = \alpha \cdot P_{confidence} + \beta \cdot S_{LM} + \gamma \cdot S_{acoustic}$$
Where:
- $\alpha$ = Original ASR confidence weight (default: 0.4)
- $\beta$ = Language model score weight (default: 0.35)
- $\gamma$ = Acoustic similarity weight (default: 0.25)
๐ก๏ธ Anti-Hallucination Safeguards
The LLM polishing stage includes multiple safeguards:
- Number Preservation: All numeric sequences must appear unchanged
- Overlap Validation: Enhanced text must maintain >50% word overlap
- Grounding Prompts: Explicit instructions to only fix errors, not add content
- Retry Logic: Multiple attempts with validation between each
๐งช Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=asr_enhancer --cov-report=html
# Run specific test file
pytest tests/test_detectors.py -v
๐ณ Docker Deployment
# Build and run with Docker Compose
docker-compose up -d
# View logs
docker-compose logs -f asr-enhancer
# Pull Ollama model (first time)
docker exec asr-enhancer-ollama ollama pull llama3.1
๐ Next Implementation Steps
Phase 1: Core Implementation (Current)
- Project scaffolding
- Module stubs with interfaces
- FastAPI service structure
- CLI tool skeleton
- Docker configuration
Phase 2: Detection & Analysis
- Implement sliding window confidence detection
- Add acoustic anomaly detection
- Build numeric gap pattern matching
- Unit tests for detectors
Phase 3: Secondary ASR
- Whisper backend integration
- Audio segment extraction
- Batch processing support
- Latency optimization
Phase 4: Numeric Reconstruction
- Pattern analyzer for phone/OTP/amounts
- Acoustic confusion correction
- Sequence completion rules
- Validation with Luhn checks
Phase 5: Domain Vocabulary
- Lexicon file format and loading
- Fuzzy matching implementation
- Phonetic matching (Soundex/Metaphone)
- Case-preserving correction
Phase 6: LLM Integration
- Prompt template refinement
- Multi-provider support testing
- Anti-hallucination validation
- Fallback strategies
Phase 7: Fusion & Validation
- N-best hypothesis fusion
- Language model perplexity scoring
- Consistency validation
- Completeness checks
Phase 8: Production Hardening
- Performance benchmarks
- Memory optimization
- Streaming support
- Monitoring & metrics
- Load testing
๐ค Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
pytest - Run linting:
ruff check . && black --check . - Submit a pull request
๐ License
MIT License - see LICENSE file for details.
๐ Related Projects
- NVIDIA Parakeet - Multilingual ASR
- OpenAI Whisper - General-purpose ASR
- NVIDIA Riva - Streaming ASR platform
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file asr_enhancer-0.2.1.tar.gz.
File metadata
- Download URL: asr_enhancer-0.2.1.tar.gz
- Upload date:
- Size: 106.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c849b67e99882cb3c7d556ea026975da312de00a843871bd1740731a704dee3
|
|
| MD5 |
f2f83150ef7ca3f509da552250650e2b
|
|
| BLAKE2b-256 |
f1b121076fe91202fa580ed45a57379b24dca1cba6e6f9c2d48661f56d2bf51e
|
File details
Details for the file asr_enhancer-0.2.1-py3-none-any.whl.
File metadata
- Download URL: asr_enhancer-0.2.1-py3-none-any.whl
- Upload date:
- Size: 118.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3c89355db206c5e210c9df46a6f62957912b5df51ff431e2e31d4ca27850612
|
|
| MD5 |
a1ede3b1fe568408a537e5f96218e71d
|
|
| BLAKE2b-256 |
86cbd6b0d6e3d9a2091c498749b11fefc78bf6856a23efaa7db79f2aa471ccc3
|