Skip to main content

Taiwanese Hokkien Transliterator and Tokeniser

Project description


<img src="https://raw.githubusercontent.com/andreihar/taibun/main/readme/logo.png" alt="Logo" width="90" height="80">

Taibun

Tests

Contributors

Release

Licence

LinkedIn

Taiwanese Hokkien Transliterator and Tokeniser

It has methods that allow to customise transliteration and retrieve any necessary information about Taiwanese Hokkien pronunciation.

Includes word tokeniser for Taiwanese Hokkien.

Report Bug

PyPI


Table of Contents
    <li><a href="#install">Install</a></li>
    
    <li>
    
      <a href="#usage">Usage</a>
    
      <ul>
    
        <li>
    
          <a href="#converter">Converter</a>
    
          <ul>
    
            <li><a href="#system">System</a></li>
    
            <li><a href="#dialect">Dialect</a></li>
    
            <li><a href="#format">Format</a></li>
    
            <li><a href="#delimiter">Delimiter</a></li>
    
            <li><a href="#sandhi">Sandhi</a></li>
    
            <li><a href="#punctuation">Punctuation</a></li>
    
          </ul>
    
        </li>
    
        <li><a href="#tokeniser">Tokeniser</a></li>
    
      </ul>
    
    </li>
    
    <li><a href="#example">Example</a></li>
    
    <li><a href="#data">Data</a></li>
    
    <li><a href="#licence">Licence</a></li>
    

Install

Taibun can be installed from pypi

$ pip install taibun

Usage

Converter

Converter class transliterates the Chinese characters to the chosen transliteration system with parameters specified by the developer. Works for both Traditional and Simplified characters.

# constructor

c = Converter(system, dialect, format, delimiter, sandhi, punctuation)



# transliterate Chinese characters

c.get(input)



# convert Simplified Chinese characters to Traditional Chinese Characters

c.to_traditional(input)

System

system String - system of transliteration.

| text | Tailo | POJ | Zhuyin | TLPA | Pingyim | Tongiong |

|------|---------|---------|-------------|-----------|---------|----------|

| 臺灣 | Tâi-uân | Tâi-oân | ㄉㄞˊ ㄨㄢˊ | Tai5 uan5 | Dáiwán | Tāi-uǎn |

Dialect

dialect String - preferred pronunciation.

  • south (default) - Zhangzhou-leaning pronunciation

  • north - Quanzhou-leaning pronunciation

| text | south | north |

|--------|---------------|---------------|

| 五月節 | Gōo-gue̍h-tseh | Gōo-ge̍h-tsueh |

Format

format String - format in which tones will be represented in the converted sentence.

  • mark (default) - uses diacritics for each syllable. Not available for TLPA.

  • number - add a number which represents the tone at the end of the syllable

  • strip - removes any tone marking

| text | mark | number | strip |

|------|---------|-----------|---------|

| 臺灣 | Tâi-uân | Tai5-uan5 | Tai-uan |

Delimiter

delimiter String - sets the delimiter character that will be placed in between syllables of a word.

Default value depends on the chosen system:

  • '-' - for Tailo, POJ, Tongiong

  • '' - for Pingyim

  • ' ' - for Zhuyin, TLPA

| text | '-' | '' | ' ' |

|------|---------|--------|---------|

| 臺灣 | Tâi-uân | Tâiuân | Tâi uân |

Sandhi

sandhi Boolean - applies the sandhi rules of Taiwanese Hokkien to syllables of a single word.

Default value depends on the chosen system:

  • True - for Tongiong

  • False - for Tailo, POJ, Zhuyin, TLPA, Pingyim

| text | False | True |

|----------|--------------|--------------|

| 馬來西亞 | Má-lâi-se-a | Ma-lāi-sē-a |

Sandhi rules also change depending on the dialect chosen.

| text | no sandhi | south | north |

|------|-----------|---------|---------|

| 臺灣 | Tâi-uân | Tāi-uân | Tài-uân |

Note that the function is different from real sandhi rules, where changes are applied to every single syllable of the sentence, not just single words.

  • Taibun's sandhi rules: Thái-khong pīng-iú, lin-hó! Lín tsià-pá buē?

  • Actual sandhi rules: Thái-khōng pīng-iú, lin-hó! Lin tsià-pa buē?

Punctuation

punctuation String

  • format (default) - converts Chinese-style punctuation to Latin-style punctuation and capitalises words at the beginning of each sentence.

  • none - preserves Chinese-style punctuation and doesn't capitalise words at the beginning of new sentences.

| text | format | none |

|-|-|-|

| 這是臺南,簡稱「南」(白話字:Tâi-lâm;注音符號:ㄊㄞˊ ㄋㄢˊ,國語:Táinán)。 | Tse sī Tâi-lâm, kán-tshing "lâm" (Pe̍h-uē-jī: Tâi-lâm; tsù-im hû-hō: ㄊㄞˊ ㄋㄢˊ, kok-gí: Táinán). | tse sī Tâi-lâm,kán-tshing「lâm」(Pe̍h-uē-jī:Tâi-lâm;tsù-im hû-hō:ㄊㄞˊ ㄋㄢˊ,kok-gí:Táinán)。 |

Tokeniser

Tokeniser class performs NLTK wordpunct_tokenize-like tokenisation of a Taiwanese Hokkien sentence.

# constructor

t = Tokeniser()



# tokenise Taiwanese Hokkien sentence

t.tokenise(input)

Example

from taibun import Converter, Tokeniser



# System

c = Converter() # Tailo system default

c.get('先生講,學生恬恬聽。')

>> Sian-sinn kóng, ha̍k-sing tiām-tiām thiann.



c = Converter(system='Zhuyin')

c.get('先生講,學生恬恬聽。')

>> ㄒㄧㄢ ㄒㆪ ㄍㆲˋ, ㄏㄚㆶ˙ ㄒㄧㄥ ㄉㄧㆰ˫ ㄉㄧㆰ˫ ㄊㄧㆩ.



# Dialect

c = Converter() # south dialect default

c.get("我欲用箸食魚")

>> Guá beh īng  tsia̍h 



c = Converter(dialect='north')

c.get("我欲用箸食魚")

>> Guá bueh īng  tsia̍h 



# Format

c = Converter() # for Tailo, mark by default

c.get("生日快樂")

>> Senn-ji̍t khuài-lo̍k



c = Converter(format='number')

c.get("生日快樂")

>> Senn1-jit8 khuai3-lok8



c = Converter(format='strip')

c.get("生日快樂")

>> Senn-jit khuai-lok



# Delimiter

c = Converter(delimiter='')

c.get("先生講,學生恬恬聽。")

>> Siansinn kóng, ha̍ksing tiāmtiām thiann.



c = Converter(system='Pingyim', delimiter='-')

c.get("先生講,學生恬恬聽。")

>> Siān-snī gǒng, hág-sīng diâm-diâm tinā.



# Sandhi

c = Converter() # for Tailo, sandhi False by default

c.get("南迴鐵路")

>> Lâm-huê-thih-lōo



c = Converter(sandhi=True)

c.get("南迴鐵路")

>> Lām-huē-thí-lōo



# Punctuation

c = Converter() # format punctuation default

c.get("太空朋友,恁好!恁食飽未?")

>> Thài-khong pîng-, lín-! Lín tsia̍h- buē?



c = Converter(punctuation='none')

c.get("太空朋友,恁好!恁食飽未?")

>> thài-khong pîng-lín-lín tsia̍h- buē





# Tokeniser

t = Tokeniser()

t.tokenise("太空朋友,恁好!恁食飽未?")

>> ['太空', '朋友', ',', '恁好', '!', '恁', '食飽', '未', '?']

Data

Acknowledgements

  • Samuel Jen (Github · LinkedIn) - Taiwanese and Mandarin translation

Licence

Because Taibun is MIT-licensed, any developer can essentially do whatever they want with it as long as they include the original copyright and licence notice in any copies of the source code. Note, that the data used by the package is licensed under a different copyright.

The data is licensed under CC BY-SA 4.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taibun-1.0.0.tar.gz (454.8 kB view hashes)

Uploaded Source

Built Distribution

taibun-1.0.0-py3-none-any.whl (448.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page