Japanese text normalizer for mecab-neologd
Project description
neologdn
neologdn is a Japanese text normalizer for mecab-neologd.
The normalization is based on the neologd’s rules: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
Contributions are welcome!
NOTE: Installing this module requires C++11 compiler.
Installation
$ pip install neologdn
Usage
import neologdn
neologdn.normalize("ハンカクカナ")
# => 'ハンカクカナ'
neologdn.normalize("全角記号!?@#")
# => '全角記号!?@#'
neologdn.normalize("全角記号例外「・」")
# => '全角記号例外「・」'
neologdn.normalize("長音短縮ウェーーーーイ")
# => '長音短縮ウェーイ'
neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
# => 'チルダ削除ウェイ'
neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
# => 'いろんなハイフン-'
neologdn.normalize(" PRML 副 読 本 ")
# => 'PRML副読本'
neologdn.normalize(" Natural Language Processing ")
# => 'Natural Language Processing'
neologdn.normalize("かわいいいいいいいいい", repeat=6)
# => 'かわいいいいいい'
neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
# => '無駄ァ'
Benchmark
# Sample code from
# https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
import normalize_neologd
%timeit normalize(normalize_neologd.normalize_neologd)
# => 1 loop, best of 3: 18.3 s per loop
import neologdn
%timeit normalize(neologdn.normalize)
# => 1 loop, best of 3: 9.05 s per loop
neologdn is about x2 faster than sample code.
details are described as the below notebook: https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb
License
Apache Software License.
CHANGES
0.4 (2018-12-06)
Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize(“無駄無駄無駄無駄ァ”, repeat=1) -> 無駄ァ
0.3.2 (2018-05-17)
Add option for suppression removal of spaces between Japanese characters
0.2.2 (2018-03-10)
Fix bug (daku-ten & handaku-ten)
Support mac osx 10.13 (Many thanks @r9y9)
0.2.1 (2017-01-23)
Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)
0.2 (2016-04-12)
Add lengthened expression (repeating character) threshold
0.1.2 (2016-03-29)
Fix installation bug
0.1.1.1 (2016-03-19)
Support Windows
Explicitly specify to -std=c++11 in build (Many thanks @id774)
0.1.1 (2015-10-10)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for neologdn-0.4-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a40e795eaa2b57709fd58ef00ecc7c93ba6be42b8956b65769a358b05fa9207e |
|
MD5 | d33ded862c061c2510c02681055ca152 |
|
BLAKE2b-256 | 1c9e0bf2bb9c98bb85535adb10881eaf544b84ee6b85ec7f43f910e73cacd49c |
Hashes for neologdn-0.4-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 045b0bf5574cdf872536e1b477db14b02f0e1e5af41661a7e43d5baf5e874d16 |
|
MD5 | 53d104f48da0032473fd7df0141c56e1 |
|
BLAKE2b-256 | bf8f1e600138acff804a93981987c234f01e1fd60309ae1910fca27f0275ab3d |
Hashes for neologdn-0.4-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 150b05b1c5f352979df2a042a812c42c9c9594c16fa6a6aa89b1d53b5abc1310 |
|
MD5 | 8470ea1689f72c3400d9a92a66ae70dc |
|
BLAKE2b-256 | ccbee9485d74933d0707213086902fb406315b345c83fde18247aba7855ac5e0 |
Hashes for neologdn-0.4-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5bde59e2947fba72d2443bf7f0014c6a9660c440a826a5387842ba76e70b83f4 |
|
MD5 | 51ce14b9fbbd1cb99cdafea6ac3bb6e1 |
|
BLAKE2b-256 | ee0173745c0ee8e872832d20064f7539e99e085a169b511a030a17150e23df95 |
Hashes for neologdn-0.4-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 326c4d1945d59d0359b49d5bf103e6a52e62524f11bd538435bc2798ff239695 |
|
MD5 | 01cff616e92db1ecc8760e8e4ba6e39a |
|
BLAKE2b-256 | 63787eadb67b3e1ec64a79d6da2969ffd4737f01a41656b73179095eaeffbde1 |
Hashes for neologdn-0.4-cp36-cp36m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c981d62ba3c8407e7eab50349dfd65b4ea5d04944adfe628d8c869acf6c00c8 |
|
MD5 | a7a9607107c379da5c62a4ff23afb093 |
|
BLAKE2b-256 | d01ea7bdf833ad710a547e73640cecf5cadd82799d4a098cb85e8de3a318a6dc |
Hashes for neologdn-0.4-cp27-cp27m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 86d8638dbe83c9ad063c0c51b2e10ab58907532d69d0a294fa8f3ca6f6cb1b14 |
|
MD5 | 46fff8acf389dbf19886841d14351c8b |
|
BLAKE2b-256 | ad91557190832f98cc017f1d72a449afd7cacea7f2cad215e3d68a3a90a9034e |
Hashes for neologdn-0.4-cp27-cp27m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5d44f05e2bf1b941612b923ac95c94e114acd907206c04ccaaa0a2b5b4cef8c |
|
MD5 | 3c87cb3beaac3ba8ccb29aef5e6283cc |
|
BLAKE2b-256 | 04dcab8c6df7fb2097b88045ee03f3f1e14787dfb47c9474dc8749dae22da0f3 |