JpTokenPreprocessing

JpTokenPreprocessing is Python library for token preprocessing.

These details have not been verified by PyPI

Project links

Project description

https://travis-ci.org/Kesin11/JpTokenPreprocessing.svg?branch=master

JpTokenPreprocessing – Japanese Token Preprocessing

JpTokenPreprocessing is a Python library for token preprocessing. It supports filtering noize (e.g. too short token, only number or only symbol token) and normalizing (support alphabet case and unicode normalize). There are common preprocessing for natural language processing (NLP).

Usage

#coding: utf-8
# Python3
from jp_token_preprocessing import JpTokenPreprocessing
import MeCab

# Return japanese word tokens using morphological analyzer MeCab.
# And select only noun.
def tokenize(text):
    tagger = MeCab.Tagger()
    node = tagger.parseToNode(text)
    while node:
        if '名詞' in node.feature:
            surface = node.surface
            yield surface
        node = node.next

if __name__=='__main__':
    text = """
    これは自然言語処理に必須な前処理のためのモジュールです。
    形態素解析や、n-gramでトークン化した後のフィルタリング、正規化を補助します。
    一語だけのトークンや'1234'のような数字だけのトークン、'!!'のような記号だけのトークンのフィルタリング、
    全角文字'ＰＹＴＨＯＮ'の半角化、英単語'Word'の小文字化といった正規化も行えます。
    さらに必ず除外したいトークンをストップワードに設定することもできます。
    """
    stopwords = ['これ', 'こと']

    tokens = tokenize(text)
    """
    >>> print(list(tokens))

    ['', '', '言語', '処理', '必須', '前', '処理', 'ため', 'モジュール', '形態素',
    '解析', 'n', '-', 'gram', 'トー', 'クン', '化', '後', 'フィルタ', 'リング', '正規',
    '化', '補助', '一語', 'トーク', 'ン', "'", '1234', "'", 'よう', '数字','トー',
    'クン', "'!!'", 'よう', '記号', 'トー', 'クン', 'フィルタ', 'リング', '全角',
    '文字', "'", 'ＰＹＴＨＯＮ', "'", '半角', '化', '英単語', "'", 'Word',"'", '小文字',
    '化', '正規', '化', '除外', 'トーク', 'ン', 'ストップ', 'ワード', '設定', 'こと']
    """

    tokens = tokenize(text)
    preprocessor = JpTokenPreprocessing(number=False,
                                        symbol=False,
                                        case='lower',
                                        unicode='NFKC',
                                        min_len=2,
                                        stopwords=stopwords)
    tokens = preprocessor.preprocessing(tokens)
    # Return iterator of tokens. Using list() for print sample.
    """
    >>> print(list(tokens))
    ['言語', '処理', '必須', '処理', 'ため', 'モジュール', '形態素', '解析', 'gram',
    'トー', 'クン', 'フィルタ', 'リング', '正規', '補助', '一語', 'トーク', 'よう',
    '数字', 'トー', 'クン', 'よう', '記号', 'トー', 'クン', 'フィルタ', 'リング',
    '全角', '文字', 'python', '半角', '英単語', 'word', '小文字', '正規', '除外',
    'トーク', 'ストップ', 'ワード', '設定']
    """

Installation

pip install JpTokenPreprocessing

MeCab for python3

Please apply below patch for installing and using MeCab module with python3. (2014/09/07 MeCab 0.996)

https://code.google.com/p/mecab/issues/detail?id=7

METHODS

JpTokenPreprocessing(args)

number = BOOL (default: False)

Allow only number token.
symbol = BOOL (default: False)

Allow only symbol token.
case = ‘lower’ or ‘upper’ or ‘capitalize’

Normalize alphabet case.
unicode = ‘NFC’ or ‘NFKC’ or ‘NFD’ or ‘NFKD’a (default: ‘NFKC’)

Normalize unicode string with unicodedata.normalize().
min_len = int (default: 2)

Filter out few character token. If min_len = 2 filter out token that has only 1 or 0 character.
stopwords = list (default: [])

Filter out any token that are contained in stopword list.
JpTokenPreprocessing.preprocessing(iterable)

Return preprocessed tokens iterator.

Future work

Add some hook point for extending own preprocess.

Authors

Kenta kase kesin1202000@gmail.com

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.5a2 pre-release

Oct 12, 2015

0.1.5a pre-release

Sep 14, 2014

0.1.4a pre-release

Sep 14, 2014

0.1.3a pre-release

Sep 14, 2014

0.1.2a pre-release

Sep 14, 2014

0.1.1a pre-release

Sep 14, 2014

0.1a pre-release

Sep 14, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

JpTokenPreprocessing-0.1.5a2.tar.gz (4.3 kB view details)

Uploaded Oct 12, 2015 Source

File details

Details for the file JpTokenPreprocessing-0.1.5a2.tar.gz.

File metadata

Download URL: JpTokenPreprocessing-0.1.5a2.tar.gz
Upload date: Oct 12, 2015
Size: 4.3 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for JpTokenPreprocessing-0.1.5a2.tar.gz
Algorithm	Hash digest
SHA256	`b3c4d4520cf676f2fb236aed302195c4761751569332ebd11f2ee3ab07766f85`
MD5	`715b704f4992e85162806a33636c88ca`
BLAKE2b-256	`b9cbc5c3d000513afaad2c8b2216da83cd64cc4a7b1801d10d9a80f6bc607559`

See more details on using hashes here.

JpTokenPreprocessing 0.1.5a2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

JpTokenPreprocessing – Japanese Token Preprocessing

Usage

Installation

MeCab for python3

METHODS

Future work

Authors

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes