Yet another sentence-level tokenizer for the Japanese text
Project description
sengiri
Yet another sentence-level tokenizer for the Japanese text
DEPENDENCIES
MeCab
emoji
INSTALLATION
$ pip install sengiri
USAGE
import sengiri
print(sengiri.tokenize('うーん🤔🤔🤔どうしよう'))
#=>['うーん🤔🤔🤔', 'どうしよう']
print(sengiri.tokenize('モー娘。のコンサートに行った。'))
#=>['モー娘。のコンサートに行った。']
print(sengiri.tokenize('ありがとう^^ 助かります。'))
#=>['ありがとう^^', '助かります。']
print(sengiri.tokenize('顔文字テスト(*´ω`*)うまくいくかな?'))
#=>['顔文字テスト(*´ω`*)うまくいくかな?']
# I recommend using the NEologd dictionary.
print(sengiri.tokenize('顔文字テスト(*´ω`*)うまくいくかな?', mecab_args='-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd'))
#=>['顔文字テスト(*´ω`*)', 'うまくいくかな?']
print(sengiri.tokenize('子供が大変なことになった。'
'(後で聞いたのだが、脅されたらしい)'
'(脅迫はやめてほしいと言っているのに)'))
#=>['子供が大変なことになった。', '(後で聞いたのだが、脅されたらしい)', '(脅迫はやめてほしいと言っているのに)']
print(sengiri.tokenize('楽しかったw また遊ぼwww'))
#=>['楽しかったw', 'また遊ぼwww']
print(sengiri.tokenize('http://www.inpaku.go.jp/'))
#=>['http://www.inpaku.go.jp/']
CHANGES
0.2.3 (2025-11-28)
Enhanced URL handling stability using MeCab’s partial parsing and regex
Resolve problem about the dependency module (emoji)
Support Python 3.9-3.14
Drop supporting Python 3.4-3.8
0.2.2 (2019-10-15)
In tokenize() method, emoji_threshold parameter is available
Bugfix
0.2.1 (2019-10-12)
Works well with also a text including emoticon and www (Laughing expression)
Always treat emoji to delimiter regardless MeCab’s POS
0.1.1 (2019-10-05)
First release
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sengiri-0.2.3.tar.gz.
File metadata
- Download URL: sengiri-0.2.3.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5003b3cd1e237890502ae957e78642d5a310755cc8771b568c1795e030d25ff2
|
|
| MD5 |
06f220129ae0f0851da386cb97b94ebf
|
|
| BLAKE2b-256 |
c054ee8c0a26530f605b002ab043a9df25a8283dc226ec3da7fdc09d7c5ab0e3
|
File details
Details for the file sengiri-0.2.3-py3-none-any.whl.
File metadata
- Download URL: sengiri-0.2.3-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9db681da9bbaf82b60a42c4356a2ee70164a4665b8a6d68b5a688f9ab9dba7a
|
|
| MD5 |
d57fc4cd92e4359afa3d006deddf34f3
|
|
| BLAKE2b-256 |
876ed47f33f18c119f798680a2b0ab3ec37b045242f43c10c2560bdad7cf45c3
|