Skip to main content

No project description provided

Project description

MIT LicenseBuild Status

What’s this?

This is simple python-wrapper for Japanese Tokenizers(A.K.A Tokenizer)

This project aims to call tokenizers and split a sentence into tokens as easy as possible.

And, this project supports various Tokenization tools common interface. Thus, it’s easy to compare output from various tokenizers.

This project is available also in Github.

If you find any bugs, please report them to github issues. Or any pull requests are welcomed!

Requirements

  • Python 2.7

  • Python 3.x

    • checked in 3.5, 3.6, 3.7

Features

  • simple/common interface among various tokenizers

  • simple/common interface for filtering with stopwords or Part-of-Speech condition

  • simple interface to add user-dictionary(mecab only)

Supported Tokenizers

Mecab

Mecab is open source tokenizer system for various language(if you have dictionary for it)

See english documentation for detail

Juman

Juman is a tokenizer system developed by Kurohashi laboratory, Kyoto University, Japan.

Juman is strong for ambiguous writing style in Japanese, and is strong for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Juman++

Juman++ is a tokenizer system developed by Kurohashi laboratory, Kyoto University, Japan.

Juman++ is succeeding system of Juman. It adopts RNN model for tokenization.

Juman++ is strong for ambigious writing style in Japanese, and is strong for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Note: New Juman++ dev-version(later than 2.x) is available at Github

Kytea

Kytea is tokenizer tool developped by Graham Neubig.

Kytea has a different algorithm from one of Mecab or Juman.

Setting up

Tokenizers auto-install

make install

mecab-neologd dictionary auto-install

make install_neologd

Tokenizers manual-install

MeCab

See here to install MeCab system.

Mecab Neologd dictionary

Mecab-neologd dictionary is a dictionary-extension based on ipadic-dictionary, which is basic dictionary of Mecab.

With, Mecab-neologd dictionary, you’re able to parse new-coming words make one token.

Here, new-coming words is such like, movie actor name or company name…..

See here and install mecab-neologd dictionary.

Juman

wget -O juman7.0.1.tar.bz2 "http://nlp.ist.i.kyoto-u.ac.jp/DLcounter/lime.cgi?down=http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/juman/juman-7.01.tar.bz2&name=juman-7.01.tar.bz2"
bzip2 -dc juman7.0.1.tar.bz2  | tar xvf -
cd juman-7.01
./configure
make
[sudo] make install

Juman++

  • GCC version must be >= 5

wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.02.tar.xz
tar xJvf jumanpp-1.02.tar.xz
cd jumanpp-1.02/
./configure
make
[sudo] make install

Kytea

Install Kytea system

wget http://www.phontron.com/kytea/download/kytea-0.4.7.tar.gz
tar -xvf kytea-0.4.7.tar
cd kytea-0.4.7
./configure
make
make install

Kytea has python wrapper thanks to michiaki ariga. Install Kytea-python wrapper

pip install kytea

install

[sudo] python setup.py install

Note

During install, you see warning message when it fails to install pyknp or kytea.

if you see these messages, try to re-install these packages manually.

Usage

Tokenization Example(For python3.x. To see exmaple code for Python2.x, plaese see here)

import JapaneseTokenizer
input_sentence = '10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。'
# ipadic is well-maintained dictionary #
mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic')
print(mecab_wrapper.tokenize(input_sentence).convert_list_object())

# neologd is automatically-generated dictionary from huge web-corpus #
mecab_neologd_wrapper = JapaneseTokenizer.MecabWrapper(dictType='neologd')
print(mecab_neologd_wrapper.tokenize(input_sentence).convert_list_object())

Filtering example

import JapaneseTokenizer
# with word filtering by stopword & part-of-speech condition #
print(mecab_wrapper.tokenize(input_sentence).filter(stopwords=['テレビ朝日'], pos_condition=[('名詞', '固有名詞')]).convert_list_object())

Part-of-speech structure

Mecab, Juman, Kytea have different system of Part-of-Speech(POS).

You can check tables of Part-of-Speech(POS) here

Similar Package

natto-py

natto-py is sophisticated package for tokenization. It supports following features

  • easy interface for tokenization

  • importing additional dictionary

  • partial parsing mode

LICENSE

MIT license

For developers

You could build an environment which has dependencies to test this package.

Simply, you build docker image and run docker container.

Dev environment

Develop environment is defined with test/docker-compose-dev.yml.

With the docker-compose.yml file, you could call python2.7 or python3.7

If you’re using Pycharm Professional edition, you could set docker-compose.yml as remote interpreter.

To call python2.7, set /opt/conda/envs/p27/bin/python2.7

To call python3.7, set /opt/conda/envs/p37/bin/python3.7

Test environment

These commands checks from procedures of package install until test of package.

$ docker-compose build
$ docker-compose up

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

JapaneseTokenizer-1.6.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

JapaneseTokenizer-1.6-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file JapaneseTokenizer-1.6.tar.gz.

File metadata

  • Download URL: JapaneseTokenizer-1.6.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.18.4 setuptools/36.5.0.post20170921 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.8

File hashes

Hashes for JapaneseTokenizer-1.6.tar.gz
Algorithm Hash digest
SHA256 c9b93fb9d355a10b4e6484ac25febcecee00e40e42e6ad38beb2258f9e8d7900
MD5 f5d47f7d1d6bd0381f4ff56f25a52dee
BLAKE2b-256 134e758f36d3d7f51d9c10d07d46a265f3f2337062b1913c231223156b719d0f

See more details on using hashes here.

File details

Details for the file JapaneseTokenizer-1.6-py3-none-any.whl.

File metadata

  • Download URL: JapaneseTokenizer-1.6-py3-none-any.whl
  • Upload date:
  • Size: 44.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.18.4 setuptools/36.5.0.post20170921 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.8

File hashes

Hashes for JapaneseTokenizer-1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 72e54e004071d6e26e53ee6fbf384ac696260a4ccd98517b460c7322d600c2cb
MD5 2023647d0aea042e51fac3be0450dee8
BLAKE2b-256 2963b71a2fa2ba0ad681d0877bab89b56d162b0e07533b99c5f3477f9a3df7f0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page