Japanese text normalizer for mecab-neologd
Project description
neologdn
neologdn is a Japanese text normalizer for mecab-neologd.
The normalization is based on the neologd’s rules: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
Contributions are welcome!
NOTE: Installing this module requires C++11 compiler.
Installation
$ pip install neologdn
Usage
import neologdn
neologdn.normalize("ハンカクカナ")
# => 'ハンカクカナ'
neologdn.normalize("全角記号!?@#")
# => '全角記号!?@#'
neologdn.normalize("全角記号例外「・」")
# => '全角記号例外「・」'
neologdn.normalize("長音短縮ウェーーーーイ")
# => '長音短縮ウェーイ'
neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
# => 'チルダ削除ウェイ'
neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
# => 'いろんなハイフン-'
neologdn.normalize(" PRML 副 読 本 ")
# => 'PRML副読本'
neologdn.normalize(" Natural Language Processing ")
# => 'Natural Language Processing'
Benchmark
# Sample code from
# https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
import normalize_neologd
%timeit normalize(normalize_neologd.normalize_neologd)
# => 1000 loops, best of 3: 1.18 ms per loop
import neologdn
%timeit normalize(neologdn.normalize)
# => 10000 loops, best of 3: 140 µs per loop
neologdn is about x10 faster than sample code.
details are described as the below notebook: https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb
License
Apache Software License.
CHANGES
0.1.1.1 (2016-03-19)
Support Windows
Explicitly specify to -std=c++11 in build (Many thanks id774)
0.1 (2015-10-10)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
neologdn-0.1.1.1.tar.gz
(44.9 kB
view details)
File details
Details for the file neologdn-0.1.1.1.tar.gz
.
File metadata
- Download URL: neologdn-0.1.1.1.tar.gz
- Upload date:
- Size: 44.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 393b19030c1d2867f373046a4de44c56e947309db27556a956b3965983e6c3f3 |
|
MD5 | e4a139c7f01b431dcbef4a2826638c33 |
|
BLAKE2b-256 | 87cbaf959540735544e28f0372e5e79624d8ad667c87452320678be6acb9f3e0 |