Python implementation of kakasi - kana kanji simple inversion library
Project description
Pykakasi
Overview
pykakasi is a Python Natural Language Processing (NLP) library to transliterate hiragana, katakana and kanji (Japanese text) into rōmaji (Latin/Roman alphabet). It can handle characters in NFC form.
Its algorithms are based on the kakasi library, which is written in C.
Install (from PyPI): pip install pykakasi
Supported python versions
pykakasi supports python 3.6, 3.7, 3.8, 3.9, and pypy3
Usage
Transliterate Japanese text to kana, hiragana and romaji:
import pykakasi
kks = pykakasi.kakasi()
text = "かな漢字"
result = kks.convert(text)
for item in result:
print("{}: kana '{}', hiragana '{}', romaji: '{}'".format(item['orig'], item['kana'], item['hira'], item['hepburn']))
かな: kana 'カナ', hiragana: 'かな', romaji: 'kana'
漢字: kana 'カンジ', hiragana: 'かんじ', romaji: 'kanji'
Here is an example that output as similar with furigana mode.
import pykakasi
kks = pykakasi.kakasi()
text = "かな漢字交じり文"
result = kks.convert(text)
for item in result:
print("{}[{}] ".format(item['orig'], item['hepburn'].capitalize()), end='')
print()
かな[Kana] 漢字[Kanji] 交じり[Majiri] 文[Bun]
Benchmark result
You can see benchmark result on various versions and platforms at https://github.com/miurahr/pykakasi/issues/123
Copyright and License
- PyKakasi::
Copyright (C) 2010-2021 Hiroshi Miura and contributors(see AUTHORS)
- KAKASI Dictionary::
Copyright (C) 2010-2021 Hiroshi Miura and contributors(see AUTHORS)
Copyright (C) 1992 1993 1994 Hironobu Takahashi, Masahiko Sato, Yukiyoshi Kameyama, Miki Inooka, Akihiko Sasaki, Dai Ando, Junichi Okukawa, Katsushi Sato and Nobuhiro Yamagishi
- UniDic::
Copyright (c) 2011-2021, The UniDic Consortium
All rights reserved.
Unidic is released under any of the GPL2, the LGPL2.1, or the 3-clause BSD License. (See src/data/unidic/BSD.txt) PyKakasi relicenses a part of the unidic with GPL3+.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
PyKakasi ChangeLog
All notable changes to this project will be documented in this file.
Unreleased
Added
Changed
Fixed
Deprecated
Removed
Security
v2.1.1 (16, May 2021)
Added
Provide Kakasi.normalize(text) class method
Add unidic data into data (not used yet), and add parse utility.
Fixed
Put type hint stub into package
Copyright notifications
Changed
Expand all cletter into dictionary (#139)
Change primary kanwadict index from str to int
test: gather all legacy test into test_pykakasi_legacy.py file.
v2.1.0 (6, May 2021)
Added
Deprecation warning when using old api(#124)
Add type hint file(pyi) (#124)
Benchmark test codes(#122)
Changed
Cache internal results and improve performance about 30-40 times.(#128)
Use standard pickle for database file(#128)
Exceptions module is now pykakasi, not pykakasi.exceptions
Removed
Dependency for klepto(#128)
v2.0.8 (4, May 2021)
Added
test: Benchmark and profiling (#122)
Changed
Performance: avoid ord() when checking long-mark, speed up about 6%
Reformat code by black(#121)
v2.0.7 (26, Feb. 2021)
Fixed
Infinite loop after running for a while, handle independent HW VOICED SOUND MARK (#115, #118)
v2.0.6 (7, Feb. 2021)
Fixed
Hiragana for Age countersa(#116,#117)
v2.0.5 (5, Feb. 2021)
Changed
CLI: use argparse for option parse(#113)
Fixed
Handle 思った、言った、行った properly.(#114)
CI: fix coveralls error
Deprecated
CI: drop travis-ci test and badge
v2.0.4 (26, Nov. 2020)
Fixed
CLI: Fix -v and -h option crash on python 3.7 and before (#108).
v2.0.3 (25, Nov. 2020)
Fixed
CLI: Fix -v and -h option crash (#108).
v2.0.2 (23, Jul. 2020)
Fixed
Fix convert() to handle Katakana correctly.(#103)
v2.0.1 (23, Jul. 2020)
Changed
Update setup.py, setup.cfg, tox.ini(#102)
Fixed
Fix convert() misses last part of a text (#99, #100)
Fix CI, coverage, and coveralls configurations(#101)
v2.0.0 (31, May. 2020)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.