Skip to main content

A Python library for conversion between Traditional and Simplified Chinese, inspired by MediaWiki's LanguageConverter.

Project description

langconv

langconv is a Python library for conversion between Traditional and Simplified Chinese and potentially more languages, inspired by MediaWiki's LanguageConverter.

Install and usage

This library is guaranteed to be working in Python 3.12, and the minimum support target is Python 3.9. If langconv is not working in >= Python 3.9, especially the latest Python version, you may open an issue.

To install, use pip or any other package manager you prefer:

$ pip install langconv

This is a minimal working code example:

from langconv.converter import LanguageConverter
from langconv.language.zh import zh_cn, zh_tw  # zh_hk also supported

lc_cn = LanguageConverter.from_language(zh_cn)  # target variant set to zh-cn
lc_tw = LanguageConverter.from_language(zh_tw)  # target variant set to zh-tw

print(lc_cn.convert('人人生而自由,在尊嚴和權利上一律平等。他們賦有理性和良心,並應以兄弟關係的精神相對待。'))
# Expected:          人人生而自由,在尊严和权利上一律平等。他们赋有理性和良心,并应以兄弟关系的精神相对待。
print(lc_tw.convert('人人生而自由,在尊严和权利上一律平等。他们赋有理性和良心,并应以兄弟关系的精神相对待。'))
# Expected:          人人生而自由,在尊嚴和權利上一律平等。他們賦有理性和良心,並應以兄弟關係的精神相對待。

Documentation

Unfortunately, documentation is not available yet. In the meantime, you may look for some examples inside the test folder. Docstrings for functions are also available for your convenience.

Design

langconv is designed to mock MediaWiki's LanguageConverter.php mechanism as much as feasible. One big diversions from LanguageConverter is that, to achieve fast conversion speed, langconv comes it own implementation of a trie, instead of search-replacing strings. This makes conversion speed faster, although it comes at some costs e.g. memory cost.

langconv ships with its own set of conversion tables to power Traditional (including Taiwan and Hong Kong variants) and Simplified (including China variant) Chinese conversion. These conversion tables are copied from MediaWiki and they are battle-tested from extensive use on wikis including Chinese Wikipedia and hundreds of Chinese MediaWiki sites. You can learn more about its licensing here. You may also bring your own table, and it should be fairly straightforward do so.

langconv supports MediaWiki special conversion syntax for more versatile, advanced conversion result. However, not the full set of MediaWiki conversion syntax is available yet. You may file an issue for unsupported syntax.

Comparison

Currently, the two most commonly used Chinese variant conversion systems are MediaWiki's LanguageConverter, powering Chinese Wikipedia, and OpenCC (Open Chinese Convert). All conversion libraries have endeavored on one thing, that is making conversion result more reliable and accurate. However, this task is not easy, for:

  • Simplified Chinese combined multiple characters into one, and this would require some sort of context recognition to properly convert them.
  • Because of the many years of geographical, political, and, most importantly, culture division in the last century, many terms the variants chose were different, and that has stuck with us since.

If you compare MediaWiki and OpenCC results, honestly, they are good enough. But they still get things wrong. That's why Wikipedia figured out that granular control, including specified conversion table on topical, page and word level is necessary to give a perfect output. This library, by copying MediaWiki's approach, is about offering perfect conversion results (as long as you want to put more work into it). These are some use cases I recommend:

  • Generic Simplified-Traditional conversion
  • Mocking MediaWiki conversion behavior and syntax, without actually running a real MediaWiki(the original use case)
  • Use cases where a perfect conversion result with no compromise is needed, and you don't mind a little bit of manual work to ensure this (e.g. on a company website)

To-Do

  • Option to opt-out MediaWiki conversion syntax entirely.
  • Performance improvements.
  • Full support for MediaWiki conversion syntax
  • Support for NoteTA group conversion

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langconv-0.3.0.tar.gz (156.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langconv-0.3.0-py3-none-any.whl (152.7 kB view details)

Uploaded Python 3

File details

Details for the file langconv-0.3.0.tar.gz.

File metadata

  • Download URL: langconv-0.3.0.tar.gz
  • Upload date:
  • Size: 156.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for langconv-0.3.0.tar.gz
Algorithm Hash digest
SHA256 816bedf81db368a410959293a31aeebe4cd75de516427b50370727003f3bd3ce
MD5 f035f44bb8d8ff090eeae2227952a96f
BLAKE2b-256 b5ae4a5eef3a5e3f0ee0d79b2ae6ddd308728ee4268b3853eac7254607e3f96a

See more details on using hashes here.

File details

Details for the file langconv-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: langconv-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 152.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for langconv-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dfd3484e0373a07ed8271ab60293648e9d216ef460c58ff7dda80315292d0566
MD5 562644fd0b523a2dfd41efef32d8d942
BLAKE2b-256 9f19cdace243e18324427c2e2f7c0a5415f0eb8b94aca1f644e50c799f77821e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page