Skip to main content

Japanese text normalizer for mecab-neologd

Project description

neologdn

PyPI DownloadsPyPI - VersionPyPI - Python VersionPyPI - License

neologdn is a Japanese text normalizer for mecab-neologd.

The normalization is based on the neologd's rules: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja

And also some optional features are added.

Contributions are welcome!

NOTE: Installing this module requires C++11 compiler.

Installation

pip install neologdn

If setuptools is not installed, you must install it:

pip install setuptools

If you encountered the following error:

ERROR: Could not find a version that satisfies the requirement setuptools (from versions: none)

Then execute the following commands to may solve this error:

pip install wheel
pip install --no-build-isolation neologdn

Usage

import neologdn
neologdn.normalize("ハンカクカナ")
# => 'ハンカクカナ'
neologdn.normalize("全角記号!?@#")
# => '全角記号!?@#'
neologdn.normalize("全角記号例外「・」")
# => '全角記号例外「・」'
neologdn.normalize("長音短縮ウェーーーーイ")
# => '長音短縮ウェーイ'
neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
# => 'チルダ削除ウェイ'
neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
# => 'いろんなハイフン-'
neologdn.normalize("   PRML  副 読 本   ")
# => 'PRML副読本'
neologdn.normalize(" Natural Language Processing ")
# => 'Natural Language Processing'
neologdn.normalize("かわいいいいいいいいい", repeat=6)
# => 'かわいいいいいい'
neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
# => '無駄ァ'
neologdn.normalize("1995〜2001年", tilde="normalize")
# => '1995~2001年'
neologdn.normalize("1995~2001年", tilde="normalize_zenkaku")
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="ignore")  # Don't convert tilde
# => '1995〜2001年'
neologdn.normalize("1995〜2001年", tilde="remove")
# => '19952001年'
neologdn.normalize("1995〜2001年")  # Default parameter
# => '19952001年'

Benchmark

# Sample code from
# https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
import normalize_neologd

%timeit normalize(normalize_neologd.normalize_neologd)
# => 9.55 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import neologdn
%timeit normalize(neologdn.normalize)
# => 6.66 s ± 35.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

neologdn is about x1.43 faster than sample code.

details are described as the below notebook: https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb

License

Apache Software License.

CHANGES

0.5.4 (2025-03-15)

  • Support Python 3.13
  • Fix tilde loss after latin and whitespace (Many thanks @a-lucky)

0.5.3 (2024-05-03)

  • Support Python 3.12

0.5.2 (2023-08-03)

  • Support Python 3.10 and 3.11 (Many thanks @polm)

0.5.1 (2021-05-02)

  • Improve performance of shorten_repeat function (Many thanks @yskn67)
  • Add tilde option to normalize function

0.4 (2018-12-06)

  • Add shorten_repeat function, which shortening contiguous substring. For example: neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1) -> 無駄ァ

0.3.2 (2018-05-17)

  • Add option for suppression removal of spaces between Japanese characters

0.2.2 (2018-03-10)

  • Fix bug (daku-ten & handaku-ten)
  • Support mac osx 10.13 (Many thanks @r9y9)

0.2.1 (2017-01-23)

  • Fix bug (Check if a previous character of daku-ten character is in maps) (Many thanks @unnonouno)

0.2 (2016-04-12)

  • Add lengthened expression (repeating character) threshold

0.1.2 (2016-03-29)

  • Fix installation bug

0.1.1.1 (2016-03-19)

  • Support Windows
  • Explicitly specify to -std=c++11 in build (Many thanks @id774)

0.1.1 (2015-10-10)

Initial release.

Contribution

Contributions are welcome! See: https://github.com/ikegami-yukino/neologdn/blob/master/.github/CONTRIBUTING.md

Cited by

Book

  • 山本 和英. テキスト処理の要素技術. 近代科学者. P.41. 2021.

Blog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neologdn-0.5.4.tar.gz (97.0 kB view details)

Uploaded Source

Built Distributions

neologdn-0.5.4-cp313-cp313t-win_amd64.whl (55.0 kB view details)

Uploaded CPython 3.13tWindows x86-64

neologdn-0.5.4-cp313-cp313t-win32.whl (50.1 kB view details)

Uploaded CPython 3.13tWindows x86

neologdn-0.5.4-cp313-cp313-win_amd64.whl (52.1 kB view details)

Uploaded CPython 3.13Windows x86-64

neologdn-0.5.4-cp313-cp313-win32.whl (48.0 kB view details)

Uploaded CPython 3.13Windows x86

neologdn-0.5.4-cp312-cp312-win_amd64.whl (52.5 kB view details)

Uploaded CPython 3.12Windows x86-64

neologdn-0.5.4-cp312-cp312-win32.whl (48.1 kB view details)

Uploaded CPython 3.12Windows x86

neologdn-0.5.4-cp311-cp311-win_amd64.whl (53.1 kB view details)

Uploaded CPython 3.11Windows x86-64

neologdn-0.5.4-cp311-cp311-win32.whl (48.5 kB view details)

Uploaded CPython 3.11Windows x86

neologdn-0.5.4-cp310-cp310-win_amd64.whl (53.1 kB view details)

Uploaded CPython 3.10Windows x86-64

neologdn-0.5.4-cp310-cp310-win32.whl (48.5 kB view details)

Uploaded CPython 3.10Windows x86

neologdn-0.5.4-cp39-cp39-win_amd64.whl (53.1 kB view details)

Uploaded CPython 3.9Windows x86-64

neologdn-0.5.4-cp39-cp39-win32.whl (42.8 kB view details)

Uploaded CPython 3.9Windows x86

neologdn-0.5.4-cp38-cp38-win_amd64.whl (53.2 kB view details)

Uploaded CPython 3.8Windows x86-64

neologdn-0.5.4-cp38-cp38-win32.whl (43.1 kB view details)

Uploaded CPython 3.8Windows x86

File details

Details for the file neologdn-0.5.4.tar.gz.

File metadata

  • Download URL: neologdn-0.5.4.tar.gz
  • Upload date:
  • Size: 97.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4.tar.gz
Algorithm Hash digest
SHA256 bd4d7c1e9ecf46c3b7692512ecf73d764290c7a2ccd0d462e6c5dbbc990d7c67
MD5 6e7d06c46e48cde075ac219bd250748b
BLAKE2b-256 747285e22c60db1373df01e8f769b7aa9c58c923a94b5321f1a6353a052e279f

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp313-cp313t-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp313-cp313t-win_amd64.whl
  • Upload date:
  • Size: 55.0 kB
  • Tags: CPython 3.13t, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 cc584f4ace734bf056103ed2410fdf49edc6f3cc2068aefcf81e18c2967f8455
MD5 d510bc93f5be617bed99e893ddd7e181
BLAKE2b-256 400e84eca160935e9ed29fdd36c161abd9e89d987f1e69f585570217593ab70f

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp313-cp313t-win32.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp313-cp313t-win32.whl
  • Upload date:
  • Size: 50.1 kB
  • Tags: CPython 3.13t, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp313-cp313t-win32.whl
Algorithm Hash digest
SHA256 3bc67b6df2974e37a1a781db01918df2e17a7c0440b39f0efa81ee2093df631a
MD5 ee88ff41a9283d8491160d1ea9dd7efb
BLAKE2b-256 31db140ef53857b7e156eaab60f48238769f82d78a4e09e41607169af636edc6

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 52.1 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 0339da8ab4c8a46dceefdbca6b201433e1d5ad6f17853417624edc795e5c4e8a
MD5 e8be29cda0daf137637b679903f389a9
BLAKE2b-256 10e0fbcd0f652aacf8359832d2f4feea713a408fbdc0c319f79c3f3969c73acd

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp313-cp313-win32.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp313-cp313-win32.whl
  • Upload date:
  • Size: 48.0 kB
  • Tags: CPython 3.13, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp313-cp313-win32.whl
Algorithm Hash digest
SHA256 6e71e1e8c9d06007b6cca9882aca4d8ef7215f401903179ddb88d44860468bc9
MD5 5576da1b7cf214d4eeaa962dc700709d
BLAKE2b-256 2c7f60474b7a056228e050ee626fc50b90e6a145a4371474aa2dcffb484e51fd

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 52.5 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f7652f3983bdb4bdbbfcf404258971d19ff101a6f35521b791821e46a564c42c
MD5 dab4e76e66eda9dc8512848a79ac9212
BLAKE2b-256 557e6ffd7ec8b3887a3a40df7596884447906f9b9146e8cb9b9724fc06f7c823

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp312-cp312-win32.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp312-cp312-win32.whl
  • Upload date:
  • Size: 48.1 kB
  • Tags: CPython 3.12, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 f754190b02fa8e35ebdaa54e2d4ed9c136dcabb4ea4dd84d75839d2c6bf14d19
MD5 1c8ce8e7dd074b073169f33cd7de2780
BLAKE2b-256 d9eec12047555c40a117132abe96e7a1e71601756f12fa0dc3e15b60e185a615

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 53.1 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 eebfb6f450394c907f2ae5350f204582db17dfe7f83f3226a16fa04d5093bfc6
MD5 efcbbbf9e1e3a17b1080f9d16420a1d2
BLAKE2b-256 276ba8fcb8fcfbebd7f2a7b1918ef10baa0fdc834f5e0b8d1c292e06e17f175d

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp311-cp311-win32.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp311-cp311-win32.whl
  • Upload date:
  • Size: 48.5 kB
  • Tags: CPython 3.11, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 6eef40dd26934e2205198918dda34c5674a3c6c58da2428e81963937781ea558
MD5 aa1f2bd7e92a6ccf22a09c7f91803ba2
BLAKE2b-256 9a94a0582c660d415f7fbdf56dd36e4c6f6a3e7d0fbd5240fadf6a335a1bf66e

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 53.1 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 fedf21351cc2946ee9d8dcf15e574ab356b2386043f27ae80b928832ba7692c5
MD5 dee0c2baafdbd8f7c353e0e63a73bdb9
BLAKE2b-256 82ebf9bc322611c2ed078905d65e88a553c54ebc5566bda26c07dd4469ffd080

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp310-cp310-win32.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp310-cp310-win32.whl
  • Upload date:
  • Size: 48.5 kB
  • Tags: CPython 3.10, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp310-cp310-win32.whl
Algorithm Hash digest
SHA256 f198dcc60fd578fc9c1e7e915cb8af3ba61107b6334acec7e7b247ba7519b8f4
MD5 35c23ff7987cd2a2ee8aae9f93f4d412
BLAKE2b-256 41220c285265b72e88cf64fcac8a6860212cddc1e6095ca90f9283f92d88cf70

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 53.1 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 ca89045281663a0a2d00207fec9c6f26fea221a4a4344c72a6010a55ae260dee
MD5 60e66b7716e3e6c0886650912f72295b
BLAKE2b-256 d03bf22a092e740ffda3d8048b3b42c277fe4bc2c111ead46dd2576561069c19

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp39-cp39-win32.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp39-cp39-win32.whl
  • Upload date:
  • Size: 42.8 kB
  • Tags: CPython 3.9, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp39-cp39-win32.whl
Algorithm Hash digest
SHA256 a5481a63c4bec7b1068b41c45e0d9dcb9c56c69446d195fea923d1fecd11a678
MD5 bfd8a61c9050fa8c66a515c8dbe8c781
BLAKE2b-256 2074d3aa247031b5ae6ae2890bc15550b08234894f2acb50a77b3f85c5b7ce20

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 53.2 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 952202af8044ea8b2c237baf8185d48940c65db5e44fab8565e546a9bfe58643
MD5 cabed1041b2f7951c7cf3fb7d361b8b0
BLAKE2b-256 4762bdc3a292249b631b4a8041d298975314811ba5ac988302afbbda75db000b

See more details on using hashes here.

File details

Details for the file neologdn-0.5.4-cp38-cp38-win32.whl.

File metadata

  • Download URL: neologdn-0.5.4-cp38-cp38-win32.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for neologdn-0.5.4-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 eb74e5df75563120ce7e5cdca26925aa775662c20d0937b7059c9b4ff9c10f60
MD5 c81704afa2bf1447af434ecfff869465
BLAKE2b-256 cc4786cd7edc543c2e17d678e8128a988b19cb80b689c4935081a647fc132595

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page