Japanese text normalizer for mecab-neologd
Project description
neologdn
neologdn is a Japanese text normalizer for mecab-neologd.
The normalization is based on the neologd’s rules: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
Contributions are welcome!
NOTE: Installing this module requires C++11 compiler.
Installation
$ pip install neologdn
Usage
import neologdn
neologdn.normalize("ハンカクカナ")
# => 'ハンカクカナ'
neologdn.normalize("全角記号!?@#")
# => '全角記号!?@#'
neologdn.normalize("全角記号例外「・」")
# => '全角記号例外「・」'
neologdn.normalize("長音短縮ウェーーーーイ")
# => '長音短縮ウェーイ'
neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
# => 'チルダ削除ウェイ'
neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
# => 'いろんなハイフン-'
neologdn.normalize(" PRML 副 読 本 ")
# => 'PRML副読本'
neologdn.normalize(" Natural Language Processing ")
# => 'Natural Language Processing'
neologdn.normalize("かわいいいいいいいいい", repeat=6)
# => 'かわいいいいいい'
neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
# => '無駄ァ'
neologdn.normalize("1995〜2001年", tilde="normalize")
# => '1995~2001年'
neologdn.normalize("1995~2001年", tilde="normalize_zenkaku")
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="ignore") # Don't convert tilde
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="remove")
# => '19952001年'
neologdn.normalize("1995〜2001年") # Default parameter
# => '19952001年'
Benchmark
# Sample code from
# https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
import normalize_neologd
%timeit normalize(normalize_neologd.normalize_neologd)
# => 1 loop, best of 3: 18.3 s per loop
import neologdn
%timeit normalize(neologdn.normalize)
# => 1 loop, best of 3: 9.05 s per loop
neologdn is about x2 faster than sample code.
details are described as the below notebook: https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb
License
Apache Software License.
Contribution
Contributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md
CHANGES
0.5.1 (2021-05-02)
Improve performance of shorten_repeat function (Many thanks @yskn67)
Add tilde option to normalize function
0.4 (2018-12-06)
Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize(“無駄無駄無駄無駄ァ”, repeat=1) -> 無駄ァ
0.3.2 (2018-05-17)
Add option for suppression removal of spaces between Japanese characters
0.2.2 (2018-03-10)
Fix bug (daku-ten & handaku-ten)
Support mac osx 10.13 (Many thanks @r9y9)
0.2.1 (2017-01-23)
Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)
0.2 (2016-04-12)
Add lengthened expression (repeating character) threshold
0.1.2 (2016-03-29)
Fix installation bug
0.1.1.1 (2016-03-19)
Support Windows
Explicitly specify to -std=c++11 in build (Many thanks @id774)
0.1.1 (2015-10-10)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for neologdn-0.5.1-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d17277c19901eb53f52ccdc8ec829b49441c314fde612d93528be78b64247e0d |
|
MD5 | 091548915533abb43442a468ba485c69 |
|
BLAKE2b-256 | daa734736c0133817a008308ec05fb5a3a8b0d9690c9406621adaeaf3eda8a01 |
Hashes for neologdn-0.5.1-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 636c92852a0156a8746874996cbaf114d8816119962ede8f79cb9f634ba9b97f |
|
MD5 | 27c85482464f07c32e63ca746f133950 |
|
BLAKE2b-256 | e1485bd606b1e5d2395dbffc6c14cb2b35f25dc30ddfb8d00ace6bc98ba5ef57 |
Hashes for neologdn-0.5.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb74d99c20b5864fb389d8a843afe67135285d6867c60db30ff1384fbade1d5d |
|
MD5 | 8d5dfd75810572b10d231d11e5f3e960 |
|
BLAKE2b-256 | 921a21d519019f13330e70eb25e860a6c8dc3ef463706cd47b7add7209c2c304 |
Hashes for neologdn-0.5.1-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcaa99e5635f6b3f171b5261a2b8965b831f03008adb0521081404e4a7f18226 |
|
MD5 | 4b82bc3cc704da78c71e9bea8070c888 |
|
BLAKE2b-256 | 1e13bb66d0f55d3b831d0015383be1bbe27adb72a7784c54c9fbe94917a3d115 |
Hashes for neologdn-0.5.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55e67e4e1f52589a51ade0ffd2eb772251588757e3f15c60bad372fcb613df50 |
|
MD5 | a3990e6121d89888392f2df94c762872 |
|
BLAKE2b-256 | 8383f199faa950ed6faf4b0fdf8e572754c785bd3611fb669f0e9a6c5ecee3c7 |
Hashes for neologdn-0.5.1-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd0517f18a9e818e3ca7e1cae31ecf132556bf3a79d2c8c5530d083cefdb3109 |
|
MD5 | 81d09713d7ef7ec607889b2a40bc060f |
|
BLAKE2b-256 | d01d79753c02d4a644b039ad8a66bcff73e2c302f75519a375934c076a5e3f65 |
Hashes for neologdn-0.5.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d6208fd3e24a7ad4318e59ba3858944a0a06f1f2d3e592f6644c6625c2eda91 |
|
MD5 | 204d9dc6b33ed8c885fcb286af2d1399 |
|
BLAKE2b-256 | 8e524c5dca6d915e4cf0902b5bdb0cf10f7adc1d9ce4d10eb609021d72eeb368 |
Hashes for neologdn-0.5.1-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 67e131890ba1b005f366b998bc63d80ce22fea6a5023930cd6b383ee71456d23 |
|
MD5 | 1ef1e4c46854caccb91501c28f7a9bc3 |
|
BLAKE2b-256 | 18a1e5ca1ff41103aa5b00d49e51db17e1be1a3fb21b3902948cf12e3ab89c35 |