Skip to main content

A context-aware Simplified to Traditional Chinese converter using BERT

Project description

BertCC

BertCC is a context-aware Chinese text converter that uses BERT (Bidirectional Encoder Representations from Transformers) to convert Simplified Chinese (zh_CN) to Traditional Chinese (zh_TW). Unlike dictionary-based approaches, BertCC leverages the power of contextual understanding to provide more accurate conversions.

Features

  • Context-Aware Conversion: Utilizes BERT's contextual understanding to make intelligent conversion choices based on surrounding text

    Input:  他向师生发表演说
    BertCC: 他向師生發表演說  ✓
    OpenCC: 他向師生髮表演說  ✗
    
  • Configurable Processing:

    • Adjustable batch size for optimal performance on different hardware
    • Configurable overlap between batches to maintain context across segments
    • Optional context prefix to guide conversion
  • Unlimited Text Length: Processes text of any length through intelligent batch processing with overlap

  • Hardware Flexibility: Supports both CPU and CUDA processing

  • Detailed Output Option: Can show conversion probabilities and candidate selections when run in verbose mode

Installation

pip install bertcc

Usage

Command Line Interface

# Basic usage
bertcc "要转换的文字" --batch-size 450 --overlap-size 50

# Using context prefix for proper nouns
bertcc "都是上里作的好事" -c "上里一將是人名。"
# Output: 都是上里作的好事  (preserves "上里" as a name instead of converting to "上裡")

# Show detailed conversion process
bertcc "他的头与发皆白" --verbose

Python API

from bertcc.converter import ConversionConfig, ChineseConverter

config = ConversionConfig(
    batch_size=450,
    overlap_size=50,
    context_prefix='上里一將是人名。',  # Context hint for proper nouns
    device="cuda",  # or "cpu"
    model_name="bert-base-chinese"
)

converter = ChineseConverter(config)
result = converter.convert("都是上里作的好事", show_details=True)

Usage Considerations

Input Text Quality

For optimal conversion results, consider these important guidelines:

  1. Pure Chinese Text Preferred:

    Good: 他向师生发表演说
    
    Avoid: <div>他向师生发表演说</div>  // HTML tags may affect context
    
  2. Why This Matters:

    • BERT models are trained primarily on natural Chinese text
    • Non-Chinese elements (HTML, markdown, special characters) can:
      • Disrupt the contextual understanding
      • Lead to incorrect character conversions
      • Break the natural language flow
  3. Best Practices:

    • Clean input text of HTML/XML tags before conversion
    • Remove or minimize non-Chinese characters where possible
    • Keep formatting markup separate from text being converted
    • Use context prefix for proper nouns rather than special markers
  4. Handling Mixed Content: If you must process text with mixed content:

    • Consider splitting the text into Chinese and non-Chinese segments
    • Process Chinese segments separately
    • Reassemble the text after conversion

Configuration Options

Parameter Description Default
batch_size Number of characters processed in each batch 450
overlap_size Number of overlapping characters between batches 50
context_prefix Optional prefix to provide additional context ''
device Computation device ("cuda" or "cpu") "cuda" if available
model_name BERT model to use for conversion "bert-base-chinese"

How It Works

BertCC uses a unique approach to Chinese text conversion:

  1. Text Processing:

    • Splits input text into manageable batches with overlap to maintain context
    • Identifies ambiguous characters that have multiple possible Traditional Chinese representations
  2. Masking and Prediction:

    • Replaces ambiguous characters with BERT's [MASK] token
    • For example, "发" could be "發" (to express) or "髮" (hair)
    • The surrounding context helps BERT understand which meaning is intended
  3. Contextual Decision:

    • BERT model predicts the most likely Traditional Chinese character for each mask
    • Predictions are influenced by:
      • Surrounding text context
      • Optional context prefix (useful for proper nouns)
      • Known character mappings and frequencies
  4. Batch Processing:

    • Processes text in overlapping batches to handle long texts
    • Merges results while maintaining consistency at batch boundaries

Limitations

  1. Phrase-Level Variations: Cannot handle regional phrase differences between Simplified and Traditional Chinese due to its character-by-character processing architecture. For example:

    CN: 互联网        TW: 網際網路
    CN: 数据库        TW: 資料庫
    CN: 软件          TW: 軟體
    

    BertCC will convert these character-by-character (e.g., 互联网 → 互聯網) rather than using the regionally appropriate phrase (網際網路). This is because the model operates on character-level masking and prediction, not phrase-level transformation. For applications requiring region-specific terminology conversion, additional post-processing or a different approach would be needed.

  2. Computational Resources: As a neural network-based solution, BertCC requires more computational resources compared to dictionary-based approaches like OpenCC.

  3. Processing Speed: Due to the contextual analysis, conversion speed is slower than dictionary-based methods, though this is mitigated through batch processing.

Comparison with Other Tools

Unlike traditional conversion tools like OpenCC that rely on character-to-character mapping, BertCC:

  • Considers the entire context when making conversion decisions
  • Handles ambiguous characters more accurately by understanding their usage
  • Provides confidence scores for conversions in verbose mode
  • Maintains contextual consistency across long texts through overlap processing

Contributing

Contributions are welcome! Please feel free to submit pull requests, report issues, or suggest improvements.

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bertcc-0.1.0.tar.gz (439.9 kB view details)

Uploaded Source

Built Distribution

BertCC-0.1.0-py3-none-any.whl (437.7 kB view details)

Uploaded Python 3

File details

Details for the file bertcc-0.1.0.tar.gz.

File metadata

  • Download URL: bertcc-0.1.0.tar.gz
  • Upload date:
  • Size: 439.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for bertcc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 01e7b524d35fdf492cdb1e898d6c0dbc7755281b4058afc72bcfae88e0706c72
MD5 908dfdac9084da706efd19b6ed770aee
BLAKE2b-256 c73b86ce1489a4c25204ef5e88d756068a8357bb7595b6a1fdff4171ddd9f923

See more details on using hashes here.

File details

Details for the file BertCC-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: BertCC-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 437.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for BertCC-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 65cdb2a5112855de93f9defe8ca86bd86ee8bca952bd5b0e08671d98845a1df0
MD5 a0cecc3e6d53ef00ade8cc0f5f04f011
BLAKE2b-256 15a0d7091e076151bfab2a572f3d5a4b2b29738fa71abcdeeb2235057cf011ce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page