Skip to main content

ASR Quality Enhancement Layer for Parakeet Multilingual ASR

Project description

ASR Quality Enhancement Layer

A production-grade post-processing pipeline for improving Parakeet Multilingual ASR outputs. This system addresses common ASR challenges including low-confidence word detection, numeric sequence reconstruction, domain vocabulary correction, and LLM-based contextual polishing.

๐ŸŽฏ Overview

The ASR Enhancement Layer sits between the Parakeet ASR engine and downstream applications, providing:

  • Error Detection: Identifies low-confidence spans, anomalies, and incomplete sequences
  • Secondary ASR: Re-transcribes problematic segments using Whisper/Riva
  • Numeric Reconstruction: Recovers missing digits in phone numbers, OTPs, amounts
  • Domain Vocabulary: Applies domain-specific terminology corrections
  • LLM Polishing: Fixes grammar and coherence with anti-hallucination safeguards
  • Hypothesis Fusion: Combines multiple ASR outputs using weighted scoring

๐Ÿ“ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                       ASR QUALITY ENHANCEMENT LAYER                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”               โ”‚
โ”‚  โ”‚   Parakeet   โ”‚โ”€โ”€โ”€โ–ถโ”‚    Error     โ”‚โ”€โ”€โ”€โ–ถโ”‚   Re-ASR     โ”‚               โ”‚
โ”‚  โ”‚   ASR Input  โ”‚    โ”‚  Detection   โ”‚    โ”‚  Processing  โ”‚               โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ”‚
โ”‚         โ”‚                   โ”‚                   โ”‚                        โ”‚
โ”‚         โ”‚            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚                        โ”‚
โ”‚         โ”‚            โ”‚ โ€ข Confidence โ”‚           โ”‚                        โ”‚
โ”‚         โ”‚            โ”‚ โ€ข Anomalies  โ”‚           โ”‚                        โ”‚
โ”‚         โ”‚            โ”‚ โ€ข Numeric    โ”‚           โ”‚                        โ”‚
โ”‚         โ”‚            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚                        โ”‚
โ”‚         โ–ผ                                       โ–ผ                        โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”               โ”‚
โ”‚  โ”‚   Numeric    โ”‚โ”€โ”€โ”€โ–ถโ”‚   Domain     โ”‚โ”€โ”€โ”€โ–ถโ”‚  Hypothesis  โ”‚               โ”‚
โ”‚  โ”‚ Reconstruct  โ”‚    โ”‚   Vocab      โ”‚    โ”‚    Fusion    โ”‚               โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ”‚
โ”‚         โ”‚                   โ”‚                   โ”‚                        โ”‚
โ”‚         โ”‚            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚                        โ”‚
โ”‚         โ”‚            โ”‚ โ€ข Lexicons   โ”‚           โ”‚                        โ”‚
โ”‚         โ”‚            โ”‚ โ€ข Fuzzy Matchโ”‚           โ”‚                        โ”‚
โ”‚         โ”‚            โ”‚ โ€ข Phonetic   โ”‚           โ”‚                        โ”‚
โ”‚         โ”‚            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚                        โ”‚
โ”‚         โ–ผ                                       โ–ผ                        โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”               โ”‚
โ”‚  โ”‚     LLM      โ”‚โ”€โ”€โ”€โ–ถโ”‚  Validation  โ”‚โ”€โ”€โ”€โ–ถโ”‚   Enhanced   โ”‚               โ”‚
โ”‚  โ”‚  Polishing   โ”‚    โ”‚  & Scoring   โ”‚    โ”‚   Output     โ”‚               โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ”‚
โ”‚         โ”‚                   โ”‚                                            โ”‚
โ”‚         โ”‚            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”                                     โ”‚
โ”‚         โ”‚            โ”‚ โ€ข Consistencyโ”‚                                    โ”‚
โ”‚         โ”‚            โ”‚ โ€ข Perplexity โ”‚                                    โ”‚
โ”‚         โ”‚            โ”‚ โ€ข Completenessโ”‚                                   โ”‚
โ”‚         โ”‚            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                     โ”‚
โ”‚                                                                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ Project Structure

asr_enhancer/
โ”œโ”€โ”€ __init__.py              # Package exports
โ”œโ”€โ”€ core.py                  # Main EnhancementPipeline orchestrator
โ”œโ”€โ”€ detectors/               # Error detection modules
โ”‚   โ”œโ”€โ”€ confidence_detector.py  # Low-confidence span detection
โ”‚   โ”œโ”€โ”€ anomaly_detector.py     # Segmentation/repetition anomalies
โ”‚   โ””โ”€โ”€ numeric_gap_detector.py # Incomplete number sequences
โ”œโ”€โ”€ resynthesis/             # Secondary ASR processing
โ”‚   โ”œโ”€โ”€ segment_extractor.py    # Audio segment extraction
โ”‚   โ”œโ”€โ”€ secondary_asr.py        # ASR backend abstraction
โ”‚   โ”œโ”€โ”€ whisper_backend.py      # Whisper integration
โ”‚   โ””โ”€โ”€ riva_backend.py         # NVIDIA Riva integration
โ”œโ”€โ”€ numeric/                 # Numeric reconstruction
โ”‚   โ”œโ”€โ”€ pattern_analyzer.py     # Number pattern detection
โ”‚   โ”œโ”€โ”€ sequence_reconstructor.py # Digit recovery
โ”‚   โ””โ”€โ”€ validators.py           # Phone/OTP/card validation
โ”œโ”€โ”€ vocab/                   # Domain vocabulary
โ”‚   โ”œโ”€โ”€ lexicon_loader.py       # Lexicon loading
โ”‚   โ”œโ”€โ”€ term_matcher.py         # Term matching (fuzzy/phonetic)
โ”‚   โ””โ”€โ”€ corrector.py            # Vocabulary correction
โ”œโ”€โ”€ llm/                     # LLM integration
โ”‚   โ”œโ”€โ”€ context_restorer.py     # Main LLM processor
โ”‚   โ”œโ”€โ”€ prompt_templates.py     # Anti-hallucination prompts
โ”‚   โ””โ”€โ”€ providers.py            # OpenAI/Ollama/Anthropic
โ”œโ”€โ”€ fusion/                  # Hypothesis fusion
โ”‚   โ”œโ”€โ”€ fusion_engine.py        # N-best combination
โ”‚   โ”œโ”€โ”€ scorers.py              # Acoustic/LM scoring
โ”‚   โ””โ”€โ”€ selector.py             # Candidate selection
โ”œโ”€โ”€ validators/              # Quality validation
โ”‚   โ”œโ”€โ”€ consistency_checker.py  # Content consistency
โ”‚   โ”œโ”€โ”€ perplexity_scorer.py    # Fluency scoring
โ”‚   โ””โ”€โ”€ completeness_validator.py # Gap detection
โ”œโ”€โ”€ utils/                   # Utilities
โ”‚   โ”œโ”€โ”€ config.py               # Configuration management
โ”‚   โ”œโ”€โ”€ logging.py              # Structured logging
โ”‚   โ”œโ”€โ”€ audio.py                # Audio utilities
โ”‚   โ””โ”€โ”€ text.py                 # Text utilities
โ”œโ”€โ”€ api/                     # FastAPI service
โ”‚   โ”œโ”€โ”€ main.py                 # Application entry
โ”‚   โ”œโ”€โ”€ routes.py               # API endpoints
โ”‚   โ””โ”€โ”€ schemas.py              # Pydantic models
โ””โ”€โ”€ cli/                     # Command-line interface
    โ””โ”€โ”€ __init__.py             # CLI commands

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd sound-web

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -e ".[all]"

Basic Usage

from asr_enhancer import EnhancementPipeline
from asr_enhancer.utils import Config

# Initialize pipeline
config = Config(
    confidence_threshold=0.7,
    llm_provider="ollama",
    llm_model="llama3.1",
)
pipeline = EnhancementPipeline(config)

# Enhance transcript
result = await pipeline.enhance(
    transcript="my phone number is nine one two tree four five six seven ate nine",
    word_timestamps=[
        {"word": "my", "start": 0.0, "end": 0.2},
        {"word": "phone", "start": 0.2, "end": 0.5},
        # ... more timestamps
    ],
    word_confidences=[0.95, 0.92, 0.89, 0.98, 0.85, 0.91, 0.88, 0.45, 0.92, 0.87, 0.90, 0.93, 0.38, 0.91],
)

print(f"Enhanced: {result.enhanced_transcript}")
print(f"Confidence improvement: {result.confidence_improvement:.2%}")

API Server

# Start the API server
asr-enhancer serve --host 0.0.0.0 --port 8000

# Or with Docker
docker-compose up -d

CLI Usage

# Enhance a transcript file
asr-enhancer enhance input.json -o output.json --format json

# Analyze without enhancement
asr-enhancer analyze input.json

# Check dependencies
asr-enhancer check

๐Ÿ”Œ API Endpoints

POST /api/v1/enhance

Enhance a transcript using the full pipeline.

{
  "transcript": "raw transcript text",
  "word_timestamps": [{"word": "...", "start": 0.0, "end": 0.1}],
  "word_confidences": [0.9, 0.8, ...],
  "audio_path": "/path/to/audio.wav",  // optional
  "domain_lexicon": {"term": ["variant1", "variant2"]}  // optional
}

POST /api/v1/analyze

Analyze transcript without enhancement.

GET /api/v1/diagnostics

Get pipeline diagnostics and configuration.

GET /health

Health check endpoint.

โš™๏ธ Configuration

Configuration can be set via:

  1. Configuration file (config.json)
  2. Environment variables
  3. Code

Key Settings

Setting Default Description
confidence_threshold 0.7 Threshold for low-confidence detection
sliding_window_size 3 Window size for confidence smoothing
secondary_asr_backend "whisper" Backend for re-ASR ("whisper", "riva")
llm_provider "ollama" LLM provider ("openai", "ollama", "anthropic")
llm_model "llama3.1" LLM model name
fusion_alpha 0.4 Weight for original ASR confidence
fusion_beta 0.35 Weight for language model score
fusion_gamma 0.25 Weight for acoustic similarity

Environment Variables

export ASR_CONFIDENCE_THRESHOLD=0.7
export ASR_LLM_PROVIDER=ollama
export ASR_LLM_MODEL=llama3.1
export ASR_LLM_API_KEY=your-api-key  # For OpenAI/Anthropic
export ASR_LOG_LEVEL=INFO

๐Ÿ“Š Fusion Formula

The hypothesis fusion uses weighted scoring:

$$Score = \alpha \cdot P_{confidence} + \beta \cdot S_{LM} + \gamma \cdot S_{acoustic}$$

Where:

  • $\alpha$ = Original ASR confidence weight (default: 0.4)
  • $\beta$ = Language model score weight (default: 0.35)
  • $\gamma$ = Acoustic similarity weight (default: 0.25)

๐Ÿ›ก๏ธ Anti-Hallucination Safeguards

The LLM polishing stage includes multiple safeguards:

  1. Number Preservation: All numeric sequences must appear unchanged
  2. Overlap Validation: Enhanced text must maintain >50% word overlap
  3. Grounding Prompts: Explicit instructions to only fix errors, not add content
  4. Retry Logic: Multiple attempts with validation between each

๐Ÿงช Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=asr_enhancer --cov-report=html

# Run specific test file
pytest tests/test_detectors.py -v

๐Ÿณ Docker Deployment

# Build and run with Docker Compose
docker-compose up -d

# View logs
docker-compose logs -f asr-enhancer

# Pull Ollama model (first time)
docker exec asr-enhancer-ollama ollama pull llama3.1

๐Ÿ“ˆ Next Implementation Steps

Phase 1: Core Implementation (Current)

  • Project scaffolding
  • Module stubs with interfaces
  • FastAPI service structure
  • CLI tool skeleton
  • Docker configuration

Phase 2: Detection & Analysis

  • Implement sliding window confidence detection
  • Add acoustic anomaly detection
  • Build numeric gap pattern matching
  • Unit tests for detectors

Phase 3: Secondary ASR

  • Whisper backend integration
  • Audio segment extraction
  • Batch processing support
  • Latency optimization

Phase 4: Numeric Reconstruction

  • Pattern analyzer for phone/OTP/amounts
  • Acoustic confusion correction
  • Sequence completion rules
  • Validation with Luhn checks

Phase 5: Domain Vocabulary

  • Lexicon file format and loading
  • Fuzzy matching implementation
  • Phonetic matching (Soundex/Metaphone)
  • Case-preserving correction

Phase 6: LLM Integration

  • Prompt template refinement
  • Multi-provider support testing
  • Anti-hallucination validation
  • Fallback strategies

Phase 7: Fusion & Validation

  • N-best hypothesis fusion
  • Language model perplexity scoring
  • Consistency validation
  • Completeness checks

Phase 8: Production Hardening

  • Performance benchmarks
  • Memory optimization
  • Streaming support
  • Monitoring & metrics
  • Load testing

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: pytest
  5. Run linting: ruff check . && black --check .
  6. Submit a pull request

๐Ÿ“ License

MIT License - see LICENSE file for details.

๐Ÿ”— Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asr_enhancer-0.2.1.tar.gz (106.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asr_enhancer-0.2.1-py3-none-any.whl (118.8 kB view details)

Uploaded Python 3

File details

Details for the file asr_enhancer-0.2.1.tar.gz.

File metadata

  • Download URL: asr_enhancer-0.2.1.tar.gz
  • Upload date:
  • Size: 106.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for asr_enhancer-0.2.1.tar.gz
Algorithm Hash digest
SHA256 8c849b67e99882cb3c7d556ea026975da312de00a843871bd1740731a704dee3
MD5 f2f83150ef7ca3f509da552250650e2b
BLAKE2b-256 f1b121076fe91202fa580ed45a57379b24dca1cba6e6f9c2d48661f56d2bf51e

See more details on using hashes here.

File details

Details for the file asr_enhancer-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: asr_enhancer-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 118.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for asr_enhancer-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a3c89355db206c5e210c9df46a6f62957912b5df51ff431e2e31d4ca27850612
MD5 a1ede3b1fe568408a537e5f96218e71d
BLAKE2b-256 86cbd6b0d6e3d9a2091c498749b11fefc78bf6856a23efaa7db79f2aa471ccc3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page