Skip to main content

Yet another fork of sentence-level tokenizer for the Japanese text

Project description

sengiri

travis-ci.org coveralls.io pyversion latest version license

Yet another sentence-level tokenizer for the Japanese text

DEPENDENCIES

  • MeCab

  • emoji

INSTALLATION

$ pip install sengiri

USAGE

import sengiri

print(sengiri.tokenize('うーん🤔🤔🤔どうしよう'))
#=>['うーん🤔🤔🤔', 'どうしよう']
print(sengiri.tokenize('モー娘。のコンサートに行った。'))
#=>['モー娘。のコンサートに行った。']
print(sengiri.tokenize('ありがとう^^ 助かります。'))
#=>['ありがとう^^', '助かります。']
print(sengiri.tokenize('顔文字テスト(*´ω`*)うまくいくかな?'))
#=>['顔文字テスト(*´ω`*)うまくいくかな?']
# I recommend using the NEologd dictionary.
print(sengiri.tokenize('顔文字テスト(*´ω`*)うまくいくかな?', mecab_args='-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd'))
#=>['顔文字テスト(*´ω`*)', 'うまくいくかな?']
print(sengiri.tokenize('子供が大変なことになった。'
                       '(後で聞いたのだが、脅されたらしい)'
                       '(脅迫はやめてほしいと言っているのに)'))
#=>['子供が大変なことになった。', '(後で聞いたのだが、脅されたらしい)', '(脅迫はやめてほしいと言っているのに)']
print(sengiri.tokenize('楽しかったw また遊ぼwww'))
#=>['楽しかったw', 'また遊ぼwww']
print(sengiri.tokenize('http://www.inpaku.go.jp/'))
#=>['http://www.inpaku.go.jp/']

CHANGES

0.2.2 (2019-10-15)

  • In tokenize() method, emoji_threshold parameter is available

  • Bugfix

0.2.1 (2019-10-12)

  • Works well with also a text including emoticon and www (Laughing expression)

  • Always treat emoji to delimiter regardless MeCab’s POS

0.1.1 (2019-10-05)

  • First release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sengirifix-0.1.3.tar.gz (5.0 kB view details)

Uploaded Source

File details

Details for the file sengirifix-0.1.3.tar.gz.

File metadata

  • Download URL: sengirifix-0.1.3.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0.post20201005 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.9

File hashes

Hashes for sengirifix-0.1.3.tar.gz
Algorithm Hash digest
SHA256 b9677ec8858af1f0665883578723b1498f292e3e29ae62da889e10da06eb8d9b
MD5 1e8379b7bc433e6b0855688c99d49606
BLAKE2b-256 a9dbaa692f4362264c5138dc6c798e8f08c7aec6065e553c68f7cf3130b1e1c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page