Taiwan Mandarin Phonetic Similarity Processor - 台灣國語語音相似性處理系統

These details have not been verified by PyPI

Project links

Project description

TWGY - Taiwan Mandarin Phonetic Similarity Processor

台灣國語語音相似性處理系統

TWGY is a comprehensive phonetic similarity processing system specifically optimized for Taiwan Mandarin variations. It provides advanced ASR (Automatic Speech Recognition) post-processing capabilities and phonetic similarity analysis for Chinese text.

🎯 Key Features

Core Functionality

Three-Layer Architecture: L1 consonant filtering → L2 first/last similarity → L3 full phonetic analysis
Taiwan Mandarin Optimized: Handles common Taiwan pronunciation variations:
- 平翹舌不分 (Retroflex/non-retroflex confusion)
- 前後鼻音不分 (Front/back nasal confusion)
- 邊鼻音不分 (Lateral/nasal confusion)
170,000+ Word Dictionary: Comprehensive Chinese word coverage
High Performance: <250ms processing time with concurrent query support

Advanced Features

DimSim Integration: Enhanced similarity scoring with deep learning models
Batch Processing: Efficient handling of multiple queries
Caching System: Optimized performance with intelligent caching
Training Data Collection: Automatic data logging for model improvement
CLI Interface: Command-line tools for easy usage
RESTful API Ready: Can be easily wrapped into web services

📦 Installation

From PyPI (Recommended)

pip install twgy

From Source

git clone https://github.com/yourusername/twgy
cd twgy
pip install -e .

Development Installation

git clone https://github.com/yourusername/twgy
cd twgy
pip install -e ".[dev]"

Optional Dependencies

# For enhanced features
pip install "twgy[full]"

# For API development
pip install "twgy[api]"

# All features
pip install "twgy[full,api,dev]"

🚀 Quick Start

Basic Usage

from twgy import PhoneticReranker

# Initialize the reranker
reranker = PhoneticReranker()

# Find similar words
result = reranker.rerank("知道")
print(result.candidates[:5])
# Output: ['知道', '指導', '智道', '志道', '制導']

# Check processing details
print(f"Processing time: {result.processing_time_ms:.1f}ms")
print(f"Pipeline: {result.l1_candidates_count} → {result.l2_candidates_count} → {result.l3_candidates_count}")

Convenience Functions

from twgy import quick_rerank, get_similar_words, batch_process

# Quick single query
similar = quick_rerank("知道", max_candidates=5)
print(similar)
# Output: ['知道', '指導', '智道', '志道', '制導']

# Get similarity scores
similar_with_scores = get_similar_words("知道", threshold=0.7)
for item in similar_with_scores[:3]:
    print(f"{item['word']}: {item['similarity']:.2f}")
# Output:
# 指導: 0.85
# 智道: 0.80
# 志道: 0.75

# Batch processing
words = ["知道", "資道", "吃飯"]
results = batch_process(words)
for result in results:
    print(f"{result.query}: {len(result.candidates)} candidates")

Advanced Configuration

from twgy import PhoneticReranker, RerankerConfig

# Custom configuration
config = RerankerConfig(
    l3_top_k=20,                    # Return top 20 candidates
    enable_dimsim=True,             # Enable DimSim reranking
    dimsim_stage="L2",              # Apply DimSim at L2 stage
    dimsim_weight=0.3,              # DimSim score weight
    max_processing_time_ms=500.0,   # Performance timeout
    enable_training_data_logging=True  # Collect training data
)

reranker = PhoneticReranker(config)
result = reranker.rerank("語音辨識")

🚀 快速開始

環境要求

Python 3.8+
已安裝萌典數據(17萬詞)
推薦使用MPS/CUDA加速

安裝與初始化

# 進入項目目錄
cd TWGY_V3

# 安裝依賴
pip install -r requirements.txt

基礎使用

from src.phonetic_reranker import PhoneticReranker

# 初始化系統(自動載入17萬詞典)
reranker = PhoneticReranker()

# ASR錯誤修正
result = reranker.rerank("資道")  # 輸入錯誤識別
print(result.candidates[:5])     # ['知道', '自動', '指導', '資料', '指標']
print(f"處理時間: {result.processing_time_ms:.1f}ms")  # 處理時間: 142.3ms
print(f"信心度: {result.confidence_score:.2f}")       # 信心度: 0.78

# 批量處理ASR輸出
queries = ["資道", "次飯", "醬瓜"]
results = reranker.batch_rerank(queries)
for result in results:
    print(f"{result.query} → {result.candidates[0]}")
    # 資道 → 知道
    # 次飯 → 吃飯  
    # 醬瓜 → 將瓜

高級配置

from src.phonetic_reranker import PhoneticReranker, RerankerConfig

# 自定義配置
config = RerankerConfig(
    l3_top_k=20,                        # 返回前20個候選
    enable_training_data_logging=True,  # 啟用數據收集
    max_processing_time_ms=200.0        # 處理時間限制200ms
)

reranker = PhoneticReranker(config)

# 啟用數據收集的處理
result = reranker.rerank("知道")

# 會話結束時導出訓練數據
session_summary = reranker.finalize_session()
print(f"收集了 {session_summary.total_queries} 個訓練案例")

🧪 測試與驗證

運行完整測試套件

# 核心組件測試
python test_l1_consonant_filter.py        # L1聲母篩選測試
python test_l2_first_last_reranker.py     # L2首尾重排測試  
python test_l3_full_phonetic.py           # L3完整精排測試

# 整合測試
python test_l1_l2_integration.py          # L1+L2整合測試
python test_full_pipeline.py              # 完整三層測試

# 主API測試
python src/phonetic_reranker.py           # 主API功能測試

# 最終部署驗證(89.5%通過率)
python test_final_deployment.py           # 部署就緒驗證

使用範例

# 完整使用範例演示
python example_usage.py

📝 應用場景

1. ASR錯誤修正

# 語音識別後處理
asr_errors = ["資道", "次飯", "醬瓜"]
for asr_output in asr_errors:
    result = reranker.rerank(asr_output)
    corrected = result.candidates[0]
    print(f"ASR修正: {asr_output} → {corrected}")
    # ASR修正: 資道 → 知道
    # ASR修正: 次飯 → 吃飯
    # ASR修正: 醬瓜 → 將瓜

2. 語音相似詞搜索

# 查找語音相似詞
similar_words = reranker.get_similar_words(
    "知道", 
    similarity_threshold=0.6,
    max_results=10
)
for sim_word in similar_words:
    print(f"{sim_word['word']}: {sim_word['similarity']:.2f}")

3. 批量處理服務

# 高效批量處理(支援並發)
batch_queries = ["資道", "次飯", "醬瓜", "安全"] * 25  # 100個查詢
batch_results = reranker.batch_rerank(batch_queries)

# 統計批量處理結果
successful = [r for r in batch_results if not r.error]
avg_time = sum(r.processing_time_ms for r in successful) / len(successful)
print(f"批量處理: {len(successful)}/{len(batch_queries)} 成功")
print(f"平均處理時間: {avg_time:.1f}ms")

🔧 Development

Setup Development Environment

git clone https://github.com/yourusername/twgy
cd twgy
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e ".[dev]"

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=twgy

# Run performance tests
pytest -m performance

# Run specific test categories
pytest -m "not slow"

Code Quality

# Format code
black twgy/

# Check style
flake8 twgy/

# Type checking
mypy twgy/

Building Package

# Build distribution
python -m build

# Install locally
pip install dist/twgy-3.0.0-py3-none-any.whl

📊 Performance Benchmarks

Processing Speed

Simple queries (e.g., "知道"): ~50-100ms
Medium queries (e.g., "語音辨識"): ~100-200ms
Complex queries (e.g., compound terms): ~200-250ms

Memory Usage

Initial load: ~100MB (dictionary + models)
With caches: ~150MB (includes L1/L2/L3 caches)
Peak usage: ~200MB (during batch processing)

Accuracy Metrics

Exact match in top-5: >95%
Phonetically similar in top-10: >90%
Handles Taiwan variations: >85%

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Areas for Contribution

Performance optimization: Faster algorithms, better caching
Accuracy improvement: Better phonetic models, more test cases
Language support: Additional Chinese variants, multilingual support
Integration: Web APIs, cloud deployment, ML pipeline integration

Development Workflow

Fork the repository
Create a feature branch
Make changes with tests
Run quality checks
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Support

Documentation: https://twgy.readthedocs.io/
Issues: GitHub Issues
Discussions: GitHub Discussions
Email: twgy.dev@example.com

🙏 Acknowledgments

Dictionary Sources: Various open Chinese dictionaries and corpora
Research: Based on Taiwan Mandarin phonetic variation studies
DimSim: Integration with DimSim similarity models
Community: Contributors and users who provided feedback

🔄 Changelog

v3.0.0 (Current)

Complete rewrite with three-layer architecture
DimSim integration for enhanced accuracy
Comprehensive CLI interface
Performance optimizations (<250ms processing)
Training data collection capabilities
Improved Taiwan Mandarin variation handling

v2.x (Legacy)

Basic phonetic similarity processing
Limited dictionary coverage
Single-layer processing

Made with ❤️ for the Chinese NLP community

TWGY v3.0.0 - Empowering Chinese language processing with Taiwan Mandarin phonetic intelligence

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.0.0

Aug 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twgy-3.0.0.tar.gz (22.4 MB view details)

Uploaded Aug 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

twgy-3.0.0-py3-none-any.whl (1.6 MB view details)

Uploaded Aug 12, 2025 Python 3

File details

Details for the file twgy-3.0.0.tar.gz.

File metadata

Download URL: twgy-3.0.0.tar.gz
Upload date: Aug 12, 2025
Size: 22.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for twgy-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a286cfaf96372bacb2f516247512154315936e25d4ecfb70b92d7500f9fee753`
MD5	`670377d46fc1ca1ed4ebf9be60856ba8`
BLAKE2b-256	`3ddf881e6e7ecb2236e86865b7fe64ff5a6b38f00ca14b41eac28491b3b11f7e`

See more details on using hashes here.

File details

Details for the file twgy-3.0.0-py3-none-any.whl.

File metadata

Download URL: twgy-3.0.0-py3-none-any.whl
Upload date: Aug 12, 2025
Size: 1.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for twgy-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b96008578bb33ad169c0f20fa97046c774775c8ed18f7d8338db123855f13edc`
MD5	`ac42f46757390349704d4eb3b0c5eec1`
BLAKE2b-256	`e7d1c2f5df3033d1fea1c693e19b1defaf4aac92d476ddbf62c8782ee33a89a6`

See more details on using hashes here.

twgy 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TWGY - Taiwan Mandarin Phonetic Similarity Processor

🎯 Key Features

Core Functionality

Advanced Features

📦 Installation

From PyPI (Recommended)

From Source

Development Installation

Optional Dependencies

🚀 Quick Start

Basic Usage

Convenience Functions

Advanced Configuration

🚀 快速開始

環境要求

安裝與初始化

基礎使用

高級配置

🧪 測試與驗證

運行完整測試套件

使用範例

📝 應用場景

1. ASR錯誤修正

2. 語音相似詞搜索

3. 批量處理服務

🔧 Development

Setup Development Environment

Running Tests

Code Quality

Building Package

📊 Performance Benchmarks

Processing Speed

Memory Usage

Accuracy Metrics

🤝 Contributing

Areas for Contribution

Development Workflow

📄 License

📞 Support

🙏 Acknowledgments

🔄 Changelog

v3.0.0 (Current)

v2.x (Legacy)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes