Skip to main content

Taiwan Mandarin Phonetic Similarity Processor - 台灣國語語音相似性處理系統

Project description

TWGY - Taiwan Mandarin Phonetic Similarity Processor

台灣國語語音相似性處理系統

Python Version License PyPI Version

TWGY is a comprehensive phonetic similarity processing system specifically optimized for Taiwan Mandarin variations. It provides advanced ASR (Automatic Speech Recognition) post-processing capabilities and phonetic similarity analysis for Chinese text.

🎯 Key Features

Core Functionality

  • Three-Layer Architecture: L1 consonant filtering → L2 first/last similarity → L3 full phonetic analysis
  • Taiwan Mandarin Optimized: Handles common Taiwan pronunciation variations:
    • 平翹舌不分 (Retroflex/non-retroflex confusion)
    • 前後鼻音不分 (Front/back nasal confusion)
    • 邊鼻音不分 (Lateral/nasal confusion)
  • 170,000+ Word Dictionary: Comprehensive Chinese word coverage
  • High Performance: <250ms processing time with concurrent query support

Advanced Features

  • DimSim Integration: Enhanced similarity scoring with deep learning models
  • Batch Processing: Efficient handling of multiple queries
  • Caching System: Optimized performance with intelligent caching
  • Training Data Collection: Automatic data logging for model improvement
  • CLI Interface: Command-line tools for easy usage
  • RESTful API Ready: Can be easily wrapped into web services

📦 Installation

From PyPI (Recommended)

pip install twgy

From Source

git clone https://github.com/yourusername/twgy
cd twgy
pip install -e .

Development Installation

git clone https://github.com/yourusername/twgy
cd twgy
pip install -e ".[dev]"

Optional Dependencies

# For enhanced features
pip install "twgy[full]"

# For API development
pip install "twgy[api]"

# All features
pip install "twgy[full,api,dev]"

🚀 Quick Start

Basic Usage

from twgy import PhoneticReranker

# Initialize the reranker
reranker = PhoneticReranker()

# Find similar words
result = reranker.rerank("知道")
print(result.candidates[:5])
# Output: ['知道', '指導', '智道', '志道', '制導']

# Check processing details
print(f"Processing time: {result.processing_time_ms:.1f}ms")
print(f"Pipeline: {result.l1_candidates_count}{result.l2_candidates_count}{result.l3_candidates_count}")

Convenience Functions

from twgy import quick_rerank, get_similar_words, batch_process

# Quick single query
similar = quick_rerank("知道", max_candidates=5)
print(similar)
# Output: ['知道', '指導', '智道', '志道', '制導']

# Get similarity scores
similar_with_scores = get_similar_words("知道", threshold=0.7)
for item in similar_with_scores[:3]:
    print(f"{item['word']}: {item['similarity']:.2f}")
# Output:
# 指導: 0.85
# 智道: 0.80
# 志道: 0.75

# Batch processing
words = ["知道", "資道", "吃飯"]
results = batch_process(words)
for result in results:
    print(f"{result.query}: {len(result.candidates)} candidates")

Advanced Configuration

from twgy import PhoneticReranker, RerankerConfig

# Custom configuration
config = RerankerConfig(
    l3_top_k=20,                    # Return top 20 candidates
    enable_dimsim=True,             # Enable DimSim reranking
    dimsim_stage="L2",              # Apply DimSim at L2 stage
    dimsim_weight=0.3,              # DimSim score weight
    max_processing_time_ms=500.0,   # Performance timeout
    enable_training_data_logging=True  # Collect training data
)

reranker = PhoneticReranker(config)
result = reranker.rerank("語音辨識")

🚀 快速開始

環境要求

  • Python 3.8+
  • 已安裝萌典數據(17萬詞)
  • 推薦使用MPS/CUDA加速

安裝與初始化

# 進入項目目錄
cd TWGY_V3

# 安裝依賴
pip install -r requirements.txt

基礎使用

from src.phonetic_reranker import PhoneticReranker

# 初始化系統(自動載入17萬詞典)
reranker = PhoneticReranker()

# ASR錯誤修正
result = reranker.rerank("資道")  # 輸入錯誤識別
print(result.candidates[:5])     # ['知道', '自動', '指導', '資料', '指標']
print(f"處理時間: {result.processing_time_ms:.1f}ms")  # 處理時間: 142.3ms
print(f"信心度: {result.confidence_score:.2f}")       # 信心度: 0.78

# 批量處理ASR輸出
queries = ["資道", "次飯", "醬瓜"]
results = reranker.batch_rerank(queries)
for result in results:
    print(f"{result.query}{result.candidates[0]}")
    # 資道 → 知道
    # 次飯 → 吃飯  
    # 醬瓜 → 將瓜

高級配置

from src.phonetic_reranker import PhoneticReranker, RerankerConfig

# 自定義配置
config = RerankerConfig(
    l3_top_k=20,                        # 返回前20個候選
    enable_training_data_logging=True,  # 啟用數據收集
    max_processing_time_ms=200.0        # 處理時間限制200ms
)

reranker = PhoneticReranker(config)

# 啟用數據收集的處理
result = reranker.rerank("知道")

# 會話結束時導出訓練數據
session_summary = reranker.finalize_session()
print(f"收集了 {session_summary.total_queries} 個訓練案例")

🧪 測試與驗證

運行完整測試套件

# 核心組件測試
python test_l1_consonant_filter.py        # L1聲母篩選測試
python test_l2_first_last_reranker.py     # L2首尾重排測試  
python test_l3_full_phonetic.py           # L3完整精排測試

# 整合測試
python test_l1_l2_integration.py          # L1+L2整合測試
python test_full_pipeline.py              # 完整三層測試

# 主API測試
python src/phonetic_reranker.py           # 主API功能測試

# 最終部署驗證(89.5%通過率)
python test_final_deployment.py           # 部署就緒驗證

使用範例

# 完整使用範例演示
python example_usage.py

📝 應用場景

1. ASR錯誤修正

# 語音識別後處理
asr_errors = ["資道", "次飯", "醬瓜"]
for asr_output in asr_errors:
    result = reranker.rerank(asr_output)
    corrected = result.candidates[0]
    print(f"ASR修正: {asr_output}{corrected}")
    # ASR修正: 資道 → 知道
    # ASR修正: 次飯 → 吃飯
    # ASR修正: 醬瓜 → 將瓜

2. 語音相似詞搜索

# 查找語音相似詞
similar_words = reranker.get_similar_words(
    "知道", 
    similarity_threshold=0.6,
    max_results=10
)
for sim_word in similar_words:
    print(f"{sim_word['word']}: {sim_word['similarity']:.2f}")

3. 批量處理服務

# 高效批量處理(支援並發)
batch_queries = ["資道", "次飯", "醬瓜", "安全"] * 25  # 100個查詢
batch_results = reranker.batch_rerank(batch_queries)

# 統計批量處理結果
successful = [r for r in batch_results if not r.error]
avg_time = sum(r.processing_time_ms for r in successful) / len(successful)
print(f"批量處理: {len(successful)}/{len(batch_queries)} 成功")
print(f"平均處理時間: {avg_time:.1f}ms")

🔧 Development

Setup Development Environment

git clone https://github.com/yourusername/twgy
cd twgy
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e ".[dev]"

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=twgy

# Run performance tests
pytest -m performance

# Run specific test categories
pytest -m "not slow"

Code Quality

# Format code
black twgy/

# Check style
flake8 twgy/

# Type checking
mypy twgy/

Building Package

# Build distribution
python -m build

# Install locally
pip install dist/twgy-3.0.0-py3-none-any.whl

📊 Performance Benchmarks

Processing Speed

  • Simple queries (e.g., "知道"): ~50-100ms
  • Medium queries (e.g., "語音辨識"): ~100-200ms
  • Complex queries (e.g., compound terms): ~200-250ms

Memory Usage

  • Initial load: ~100MB (dictionary + models)
  • With caches: ~150MB (includes L1/L2/L3 caches)
  • Peak usage: ~200MB (during batch processing)

Accuracy Metrics

  • Exact match in top-5: >95%
  • Phonetically similar in top-10: >90%
  • Handles Taiwan variations: >85%

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Areas for Contribution

  • Performance optimization: Faster algorithms, better caching
  • Accuracy improvement: Better phonetic models, more test cases
  • Language support: Additional Chinese variants, multilingual support
  • Integration: Web APIs, cloud deployment, ML pipeline integration

Development Workflow

  1. Fork the repository
  2. Create a feature branch
  3. Make changes with tests
  4. Run quality checks
  5. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Support

🙏 Acknowledgments

  • Dictionary Sources: Various open Chinese dictionaries and corpora
  • Research: Based on Taiwan Mandarin phonetic variation studies
  • DimSim: Integration with DimSim similarity models
  • Community: Contributors and users who provided feedback

🔄 Changelog

v3.0.0 (Current)

  • Complete rewrite with three-layer architecture
  • DimSim integration for enhanced accuracy
  • Comprehensive CLI interface
  • Performance optimizations (<250ms processing)
  • Training data collection capabilities
  • Improved Taiwan Mandarin variation handling

v2.x (Legacy)

  • Basic phonetic similarity processing
  • Limited dictionary coverage
  • Single-layer processing

Made with ❤️ for the Chinese NLP community

TWGY v3.0.0 - Empowering Chinese language processing with Taiwan Mandarin phonetic intelligence

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twgy-3.0.0.tar.gz (22.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

twgy-3.0.0-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file twgy-3.0.0.tar.gz.

File metadata

  • Download URL: twgy-3.0.0.tar.gz
  • Upload date:
  • Size: 22.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for twgy-3.0.0.tar.gz
Algorithm Hash digest
SHA256 a286cfaf96372bacb2f516247512154315936e25d4ecfb70b92d7500f9fee753
MD5 670377d46fc1ca1ed4ebf9be60856ba8
BLAKE2b-256 3ddf881e6e7ecb2236e86865b7fe64ff5a6b38f00ca14b41eac28491b3b11f7e

See more details on using hashes here.

File details

Details for the file twgy-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: twgy-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for twgy-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b96008578bb33ad169c0f20fa97046c774775c8ed18f7d8338db123855f13edc
MD5 ac42f46757390349704d4eb3b0c5eec1
BLAKE2b-256 e7d1c2f5df3033d1fea1c693e19b1defaf4aac92d476ddbf62c8782ee33a89a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page