Skip to main content

Yet another sentence-level tokenizer for the Japanese text

Project description

sengiri

pyversion latest version license download NO WAR

Yet another sentence-level tokenizer for the Japanese text

DEPENDENCIES

  • MeCab

  • emoji

INSTALLATION

$ pip install sengiri

USAGE

import sengiri

print(sengiri.tokenize('うーん🤔🤔🤔どうしよう'))
#=>['うーん🤔🤔🤔', 'どうしよう']
print(sengiri.tokenize('モー娘。のコンサートに行った。'))
#=>['モー娘。のコンサートに行った。']
print(sengiri.tokenize('ありがとう^^ 助かります。'))
#=>['ありがとう^^', '助かります。']
print(sengiri.tokenize('顔文字テスト(*´ω`*)うまくいくかな?'))
#=>['顔文字テスト(*´ω`*)うまくいくかな?']
# I recommend using the NEologd dictionary.
print(sengiri.tokenize('顔文字テスト(*´ω`*)うまくいくかな?', mecab_args='-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd'))
#=>['顔文字テスト(*´ω`*)', 'うまくいくかな?']
print(sengiri.tokenize('子供が大変なことになった。'
                       '(後で聞いたのだが、脅されたらしい)'
                       '(脅迫はやめてほしいと言っているのに)'))
#=>['子供が大変なことになった。', '(後で聞いたのだが、脅されたらしい)', '(脅迫はやめてほしいと言っているのに)']
print(sengiri.tokenize('楽しかったw また遊ぼwww'))
#=>['楽しかったw', 'また遊ぼwww']
print(sengiri.tokenize('http://www.inpaku.go.jp/'))
#=>['http://www.inpaku.go.jp/']

CHANGES

0.2.3 (2025-11-28)

  • Enhanced URL handling stability using MeCab’s partial parsing and regex

  • Resolve problem about the dependency module (emoji)

  • Support Python 3.9-3.14

  • Drop supporting Python 3.4-3.8

0.2.2 (2019-10-15)

  • In tokenize() method, emoji_threshold parameter is available

  • Bugfix

0.2.1 (2019-10-12)

  • Works well with also a text including emoticon and www (Laughing expression)

  • Always treat emoji to delimiter regardless MeCab’s POS

0.1.1 (2019-10-05)

  • First release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sengiri-0.2.3.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sengiri-0.2.3-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file sengiri-0.2.3.tar.gz.

File metadata

  • Download URL: sengiri-0.2.3.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for sengiri-0.2.3.tar.gz
Algorithm Hash digest
SHA256 5003b3cd1e237890502ae957e78642d5a310755cc8771b568c1795e030d25ff2
MD5 06f220129ae0f0851da386cb97b94ebf
BLAKE2b-256 c054ee8c0a26530f605b002ab043a9df25a8283dc226ec3da7fdc09d7c5ab0e3

See more details on using hashes here.

File details

Details for the file sengiri-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: sengiri-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for sengiri-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e9db681da9bbaf82b60a42c4356a2ee70164a4665b8a6d68b5a688f9ab9dba7a
MD5 d57fc4cd92e4359afa3d006deddf34f3
BLAKE2b-256 876ed47f33f18c119f798680a2b0ab3ec37b045242f43c10c2560bdad7cf45c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page