Taiwanese Hokkien Transliterator and Tokeniser
Project description
Taibun
Taiwanese Hokkien Transliterator and Tokeniser
It has methods that allow to customise transliteration and retrieve any necessary information about Taiwanese Hokkien pronunciation.
Includes word tokeniser for Taiwanese Hokkien.
Table of Contents
Versions
Install
Taibun can be installed from pypi
$ pip install taibun
Usage
Converter
Converter
class transliterates the Chinese characters to the chosen transliteration system with parameters specified by the developer. Works for both Traditional and Simplified characters.
# Constructor
c = Converter(system, dialect, format, delimiter, sandhi, punctuation, convert_non_cjk)
# Transliterate Chinese characters
c.get(input)
System
system
String - system of transliteration.
Tailo
(default) - Tâi-uân Lô-má-jī Phing-im Hong-ànPOJ
- Pe̍h-ōe-jīZhuyin
- Taiwanese Phonetic SymbolsTLPA
- Taiwanese Language Phonetic AlphabetPingyim
- Bbánlám Uē Pìngyīm Hōng'ànTongiong
- Daī-ghî Tōng-iōng Pīng-imIPA
- International Phonetic Alphabet
text | Tailo | POJ | Zhuyin | TLPA | Pingyim | Tongiong | IPA |
---|---|---|---|---|---|---|---|
台灣 | Tâi-uân | Tâi-oân | ㄉㄞˊ ㄨㄢˊ | Tai5 uan5 | Dáiwán | Tāi-uǎn | Tai²⁵ uan²⁵ |
Dialect
dialect
String - preferred pronunciation.
text | south | north |
---|---|---|
五月節 | Gōo-gue̍h-tseh | Gōo-ge̍h-tsueh |
Format
format
String - format in which tones will be represented in the converted sentence.
mark
(default) - uses diacritics for each syllable. Not available for TLPA.number
- add a number which represents the tone at the end of the syllablestrip
- removes any tone marking
text | mark | number | strip |
---|---|---|---|
台灣 | Tâi-uân | Tai5-uan5 | Tai-uan |
Delimiter
delimiter
String - sets the delimiter character that will be placed in between syllables of a word.
Default value depends on the chosen system
:
'-'
- forTailo
,POJ
,Tongiong
''
- forPingyim
' '
- forZhuyin
,TLPA
,IPA
text | '-' | '' | ' ' |
---|---|---|---|
台灣 | Tâi-uân | Tâiuân | Tâi uân |
Sandhi
sandhi
String - applies the sandhi rules of Taiwanese Hokkien.
Since it's difficult to encode all sandhi rules, Taibun provides multiple modes for sandhi conversion to allow for customised sandhi handling.
none
- doesn't perform any tone sandhiauto
- closest approximation to full correct tone sandhi of Taiwanese, with proper sandhi of pronouns, suffixes, and words with 仔exc_last
- changes tone for every syllable except for the last oneincl_last
- changes tone for every syllable including the last one
Default value depends on the chosen system
:
auto
- forTongiong
none
- forTailo
,POJ
,Zhuyin
,TLPA
,Pingyim
,IPA
text | none | auto | exc_last | incl_last |
---|---|---|---|---|
這是你的手機仔無 | Tse sī lí ê tshiú-ki-á bô | Tse sì li ē tshiu-kī-á bô? | Tsē sì li ē tshiu-kī-a bô | Tsē sì li ē tshiu-kī-a bō |
Sandhi rules also change depending on the dialect chosen.
text | no sandhi | south | north |
---|---|---|---|
台灣 | Tâi-uân | Tāi-uân | Tài-uân |
Punctuation
punctuation
String
format
(default) - converts Chinese-style punctuation to Latin-style punctuation and capitalises words at the beginning of each sentence.none
- preserves Chinese-style punctuation and doesn't capitalise words at the beginning of new sentences.
text | format | none |
---|---|---|
這是臺南,簡稱「南」(白話字:Tâi-lâm;注音符號:ㄊㄞˊ ㄋㄢˊ,國語:Táinán)。 | Tse sī Tâi-lâm, kán-tshing "lâm" (Pe̍h-uē-jī: Tâi-lâm; tsù-im hû-hō: ㄊㄞˊ ㄋㄢˊ, kok-gí: Táinán). | tse sī Tâi-lâm,kán-tshing「lâm」(Pe̍h-uē-jī:Tâi-lâm;tsù-im hû-hō:ㄊㄞˊ ㄋㄢˊ,kok-gí:Táinán)。 |
Convert non-CJK
convert_non_cjk
Boolean - defines whether or not to convert non-Chinese words. Can be used to convert Tailo to another romanisation system.
True
- convert non-Chinese character wordsFalse
(default) - convert only Chinese character words
text | False | True |
---|---|---|
我食pháng | ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng | ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ |
Tokeniser
Tokeniser
class performs NLTK wordpunct_tokenize-like tokenisation of a Taiwanese Hokkien sentence.
# Constructor
t = Tokeniser()
# Tokenise Taiwanese Hokkien sentence
t.tokenise(input)
Other Functions
Handy functions for NLP tasks in Taiwanese Hokkien.
# Convert to Traditional
to_traditional(input)
# Convert to Simplified
to_simplified(input)
# Check if the string is fully composed of Chinese characters
is_cjk(input)
Example
# Converter
from taibun import Converter
## System
c = Converter() # Tailo system default
c.get('先生講,學生恬恬聽。')
>> Sian-sinn kóng, ha̍k-sing tiām-tiām thiann.
c = Converter(system='Zhuyin')
c.get('先生講,學生恬恬聽。')
>> ㄒㄧㄢ ㄒㆪ ㄍㆲˋ, ㄏㄚㆶ˙ ㄒㄧㄥ ㄉㄧㆰ˫ ㄉㄧㆰ˫ ㄊㄧㆩ.
## Dialect
c = Converter() # south dialect default
c.get("我欲用箸食魚")
>> Guá beh īng tī tsia̍h hî
c = Converter(dialect='north')
c.get("我欲用箸食魚")
>> Guá bueh īng tū tsia̍h hû
## Format
c = Converter() # for Tailo, mark by default
c.get("生日快樂")
>> Senn-ji̍t khuài-lo̍k
c = Converter(format='number')
c.get("生日快樂")
>> Senn1-jit8 khuai3-lok8
c = Converter(format='strip')
c.get("生日快樂")
>> Senn-jit khuai-lok
## Delimiter
c = Converter(delimiter='')
c.get("先生講,學生恬恬聽。")
>> Siansinn kóng, ha̍ksing tiāmtiām thiann.
c = Converter(system='Pingyim', delimiter='-')
c.get("先生講,學生恬恬聽。")
>> Siān-snī gǒng, hág-sīng diâm-diâm tinā.
## Sandhi
c = Converter() # for Tailo, sandhi none by default
c.get("這是台灣囡仔")
>> Tse sī Tâi-uân gín-á
c = Converter(sandhi='auto')
c.get("這是台灣囡仔")
>> Tse sì Tāi-uān gin-á
c = Converter(sandhi='exc_last')
c.get("這是台灣囡仔")
>> Tsē sì Tāi-uān gin-á
c = Converter(sandhi='incl_last')
c.get("這是台灣囡仔")
>> Tsē sì Tāi-uān gin-a
## Punctuation
c = Converter() # format punctuation default
c.get("太空朋友,恁好!恁食飽未?")
>> Thài-khong pîng-iú, lín-hó! Lín tsia̍h-pá buē?
c = Converter(punctuation='none')
c.get("太空朋友,恁好!恁食飽未?")
>> thài-khong pîng-iú,lín-hó!lín tsia̍h-pá buē?
## Convert non-CJK
c = Convert(system='Zhuyin') # False convert_non_cjk default
c.get("我食pháng")
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng
c = Convert(system='Zhuyin', convert_non_cjk=True)
c.get("我食pháng")
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ
# Tokeniser
from taibun import Tokeniser
t = Tokeniser()
t.tokenise("太空朋友,恁好!恁食飽未?")
>> ['太空', '朋友', ',', '恁好', '!', '恁', '食飽', '未', '?']
# Other Functions
from taibun import to_traditional, to_simplified, is_cjk
to_traditional("我听无台湾话")
>> 我聽無台灣話
to_simplified("我聽無臺灣話")
>> 我听无台湾话
is_cjk('我食麭')
>> True
is_cjk('我食pháng')
>> False
Data
- Taiwanese-Chinese Online Dictionary (via ChhoeTaigi)
- iTaigi Chinese-Taiwanese Comparison Dictionary (via ChhoeTaigi)
Acknowledgements
Licence
Because Taibun is MIT-licensed, any developer can essentially do whatever they want with it as long as they include the original copyright and licence notice in any copies of the source code. Note, that the data used by the package is licensed under a different copyright.
The data is licensed under CC BY-SA 4.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for example990420-1.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c88838ded54443d059945344d562bf8e4d0932ddfee3d353cffb0794271a073 |
|
MD5 | 71ec7c3531a6b3039df632e8c4c70f42 |
|
BLAKE2b-256 | 4be2bdbca35c30c5fc2ebc8303b96e34cd697ecaffccb88ac00415a630b768a2 |