Skip to main content

Japanese text normalizer for mecab-neologd

Project description

neologdn

downloads pyversion latest version license

neologdn is a Japanese text normalizer for mecab-neologd.

The normalization is based on the neologd’s rules: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja

Contributions are welcome!

NOTE: Installing this module requires C++11 compiler.

Installation

$ pip install neologdn

Usage

import neologdn
neologdn.normalize("ハンカクカナ")
# => 'ハンカクカナ'
neologdn.normalize("全角記号!?@#")
# => '全角記号!?@#'
neologdn.normalize("全角記号例外「・」")
# => '全角記号例外「・」'
neologdn.normalize("長音短縮ウェーーーーイ")
# => '長音短縮ウェーイ'
neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
# => 'チルダ削除ウェイ'
neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
# => 'いろんなハイフン-'
neologdn.normalize("   PRML  副 読 本   ")
# => 'PRML副読本'
neologdn.normalize(" Natural Language Processing ")
# => 'Natural Language Processing'
neologdn.normalize("かわいいいいいいいいい", repeat=6)
# => 'かわいいいいいい'
neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
# => '無駄ァ'
neologdn.normalize("1995〜2001年", tilde="normalize")
# => '1995~2001年'
neologdn.normalize("1995~2001年", tilde="normalize_zenkaku")
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="ignore")  # Don't convert tilde
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="remove")
# => '19952001年'
neologdn.normalize("1995〜2001年")  # Default parameter
# => '19952001年'

Benchmark

# Sample code from
# https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
import normalize_neologd

%timeit normalize(normalize_neologd.normalize_neologd)
# => 9.55 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


import neologdn
%timeit normalize(neologdn.normalize)
# => 6.66 s ± 35.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

neologdn is about x1.43 faster than sample code.

details are described as the below notebook: https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb

License

Apache Software License.

Contribution

Contributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md

CHANGES

0.5.2 (2023-08-03)

  • Support Python 3.10 and 3.11 (Many thanks @polm)

0.5.1 (2021-05-02)

  • Improve performance of shorten_repeat function (Many thanks @yskn67)

  • Add tilde option to normalize function

0.4 (2018-12-06)

  • Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize(“無駄無駄無駄無駄ァ”, repeat=1) -> 無駄ァ

0.3.2 (2018-05-17)

  • Add option for suppression removal of spaces between Japanese characters

0.2.2 (2018-03-10)

  • Fix bug (daku-ten & handaku-ten)

  • Support mac osx 10.13 (Many thanks @r9y9)

0.2.1 (2017-01-23)

  • Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)

0.2 (2016-04-12)

  • Add lengthened expression (repeating character) threshold

0.1.2 (2016-03-29)

  • Fix installation bug

0.1.1.1 (2016-03-19)

  • Support Windows

  • Explicitly specify to -std=c++11 in build (Many thanks @id774)

0.1.1 (2015-10-10)

Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neologdn-0.5.2.tar.gz (86.2 kB view details)

Uploaded Source

Built Distributions

neologdn-0.5.2-cp311-cp311-win_amd64.whl (53.0 kB view details)

Uploaded CPython 3.11 Windows x86-64

neologdn-0.5.2-cp310-cp310-win_amd64.whl (58.1 kB view details)

Uploaded CPython 3.10 Windows x86-64

neologdn-0.5.2-cp39-cp39-win_amd64.whl (58.1 kB view details)

Uploaded CPython 3.9 Windows x86-64

neologdn-0.5.2-cp38-cp38-win_amd64.whl (52.6 kB view details)

Uploaded CPython 3.8 Windows x86-64

neologdn-0.5.2-cp37-cp37m-win_amd64.whl (65.9 kB view details)

Uploaded CPython 3.7m Windows x86-64

neologdn-0.5.2-cp36-cp36m-win_amd64.whl (68.2 kB view details)

Uploaded CPython 3.6m Windows x86-64

neologdn-0.5.2-cp27-cp27m-win_amd64.whl (54.3 kB view details)

Uploaded CPython 2.7m Windows x86-64

File details

Details for the file neologdn-0.5.2.tar.gz.

File metadata

  • Download URL: neologdn-0.5.2.tar.gz
  • Upload date:
  • Size: 86.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for neologdn-0.5.2.tar.gz
Algorithm Hash digest
SHA256 2f56b2ffddfe7f8613d52b9f6366c224af2bb217c47c1e80e227a348345cce52
MD5 baa609fd1e44fc83e68147e89f042f70
BLAKE2b-256 2574a0a015e7ce8da5d12be013f3f0cf7ce85c83b9308f4b7419b70a981e41d9

See more details on using hashes here.

File details

Details for the file neologdn-0.5.2-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for neologdn-0.5.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 61a7b3d9b8f6c6a49de333f618051e73312bf84241c8cdc4093e71e4b94bef9a
MD5 7931088c08442224e7e4aa537410ae8b
BLAKE2b-256 7da4d3b937acabe5039d0869c93325f195012f31545bcd7c395e26712ff91013

See more details on using hashes here.

File details

Details for the file neologdn-0.5.2-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for neologdn-0.5.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 b6446d648b3d2f73a69746138b6f8037117b2d14bc1336256e98745dc68577c2
MD5 f7ffc219bafa4100d795df2d2ca7c525
BLAKE2b-256 7030645df850d36cbeee3c9df89deb09b72815ef183e89f9341f092a5828481b

See more details on using hashes here.

File details

Details for the file neologdn-0.5.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 58.1 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for neologdn-0.5.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 607c22febe363666fdab9a8fae0650eab2df5dcd12324e97239a6767caabeca4
MD5 00c85b08a6a8c87f19fe2811ea24be61
BLAKE2b-256 5a74e14b9f814b3122413f81b07d72718900fcc9fd0c8d1690d4a8e2418b5a60

See more details on using hashes here.

File details

Details for the file neologdn-0.5.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 52.6 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for neologdn-0.5.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ebbe4df98b4784b75c18aed22db79dc985912a090ce8cda876cac103e89f2bae
MD5 05130539ea9681e4a70aa2deab06af38
BLAKE2b-256 d558a7452f5a0c110566f8a271438939e8a61a9e370b6deb45b8b89d3676e4f8

See more details on using hashes here.

File details

Details for the file neologdn-0.5.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 65.9 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for neologdn-0.5.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 881098e7478cfd76181f7967ab47424cd60c2fd19507e0334c33509a63c8af1c
MD5 cb5bfd8c969d2f44cc07966301f5ff63
BLAKE2b-256 edaf6db458262272640c3c12796849d90a3f97a91a1601a95e027c2cfd40ddb9

See more details on using hashes here.

File details

Details for the file neologdn-0.5.2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 68.2 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for neologdn-0.5.2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 5ae0fb12a2816d65f1ecc4e09a6e7555320283c2cb7d881e2baf2449bb1fc794
MD5 0dd8286656042b98afeb13497180e2e5
BLAKE2b-256 039665fcd58d305f7b4b846ca4734705c5f98f78bd3b4675595a199206731df8

See more details on using hashes here.

File details

Details for the file neologdn-0.5.2-cp27-cp27m-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.2-cp27-cp27m-win_amd64.whl
  • Upload date:
  • Size: 54.3 kB
  • Tags: CPython 2.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for neologdn-0.5.2-cp27-cp27m-win_amd64.whl
Algorithm Hash digest
SHA256 f4032406ef974aa3d452ba121475f70bb35325588d4695a589d363ded59b076a
MD5 e7f5ced96eb7d7e7926bd0b71ad01b13
BLAKE2b-256 028a4a979d01235313a0b18bf5591fda4c87acf5cc1ffd8d99b7c80af33fc714

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page