Skip to main content

.

Project description

言葉のしっぽ(tails-of-words)

PyPI version Python Versions Docker Image Size (latest by date) GitHub Actions

表記ゆれ検出の実装実験

概要

  • 形態素解析(jumanpp)による名詞の検出
  • knp による固有表現の検出
  • 任意品詞の出現数のレポート
  • 任意品詞の編集距離のレポート
    • レーベンシュタイン距離
    • ダメラウ・レーベンシュタイン距離
    • ジャロ・ウィンクラー距離
    • それぞれ読みの距離
  • 任意品詞の表記ゆれ検出
  • 補助動詞の漢字・ら抜き言葉の検出

Install

pip

pip install tails-of-words

別途 jumanpp と knp のインストールが必要です

e.g. brew install jumanpp

docker

docker pull srzzumix/tails-of-words

Usage

swing (表記ゆれ検出)

$ echo コンピュータとコンピューター | tails-of-words swing -
 1, 0.86, 0.86: コンピュータ(1) vs コンピューター(1) : 1.03
curl -fsSL https://srz-zumix.blogspot.com/2021/09/cedec.html | tails-of-words --stdin-type html swing --exclude-alphabet --exclude-ascii -t 1 -
 1, 0.75, 0.75: ブクログ(1) vs ブログ(6) : 1.29
 1, 0.67, 0.67: ホスト(1) vs リスト(3) : 1.00
 1, 0.67, 0.67: ホスト(1) vs テスト(3) : 1.00
$ docker run --rm -w /work -v $(pwd):/work srzzumix/tails-of-words swing /work/testdata -t 1
 1, 0.86, 0.86: コンピューター(1) vs コンピュータ(1) : 1.03
 0, 1.00, 0.67: Max(1) vs max(1) : 1.00

形態素解析のカスタム

use knp

$ echo 奈良先端科学技術大学院大学 | tails-of-words --knp count -
1 : 奈良
1 : 先端
1 : 科学
1 : 技術
1 : 大学院
1 : 大学
1 : 奈良先端科学技術大学院大学

use jaro winkler

$ echo 時間と歌人 | tails-of-words distance --jw -
0.00, 0.89: 時間(1) vs 歌人(1) : 0.00

use damerau levenshtein

$ echo 時間と歌人 | tails-of-words distance --damerau -
 2,  1, 0.00, 0.67: 時間(1) vs 歌人(1) : 0.00
$ echo 時間と歌人 | tails-of-words distance -
 2,  2, 0.00, 0.33: 時間(1) vs 歌人(1) : 0.00

typo (補助動詞の漢字・ら抜き言葉検出)

$ echo 5時に来て頂く予定です | tails-of-words typo -
1:2: に来て頂く: 補助動詞の漢字
$ echo あの人が来るとは考えれない | tails-of-words typo -
1:8: 考えれない: ら抜き言葉

Help

usage: tails-of-words [-h] [-v] [--dumpversion] [--log {DEBUG,INFO,WARN,ERROR,CRITICAL,debug,info,warn,error,critical}] [-c CONFIG]
                      [-f {csv,xml,html,plain}] [--h2t] [--knp]
                      {count,distance,show,swing,help} ...

positional arguments:
  {count,distance,show,swing,help}
    count               count words. see `count -h`
    distance            distance counted words. see `distance -h`
    show                show words. see `show -h`
    swing               show notation fluctuations. see `swing -h`
    typo                check typo. see `typo -h`
    help                show subcommand help. see `help -h`

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --dumpversion         show program's version number and exit
  --log {DEBUG,INFO,WARN,ERROR,CRITICAL,debug,info,warn,error,critical}
                        set log level
  -c CONFIG, --config CONFIG
                        config.yml file path
  -f {csv,xml,html,plain}, --stdin-type {csv,xml,html,plain}, --stdin-format {csv,xml,html,plain}
                        set stdin format type
  --h2t, --html2text    Convert input text with html2text
  --knp                 use knp.
tails-of-words help swing
usage: tails-of-words swing [-h] [-n NUM] [-t THRESHOLD] [--jw] [--damerau] [--no-alnum] [--no-ascii] [-o OUTPUT] [-c COLUMN] [-i HINSI] [-e EXCLUDE]
                            SOURCE [SOURCE ...]

show notation fluctuations

positional arguments:
  SOURCE                input files

optional arguments:
  -h, --help            show this help message and exit
  -n NUM, --num NUM     Display n items from the highest score. All if n is less than or equal to 0
  -t THRESHOLD, --threshold THRESHOLD
                        Display words whose score exceeds the threshold.
  --jw, --jaro-winkler  use jaro_winkler.
  --damerau, --damerau-levenshtein
                        use damerau_levenshtein.
  --no-alnum, --exclude-alphabet
                        exclude isalpha or isalnum string.
  --no-ascii, --exclude-ascii
                        exclude isascii string.
  -o OUTPUT, --output OUTPUT
                        output json file path.
  -c COLUMN, --column COLUMN
                        specific csv file column name.
  -i HINSI, --hinsi HINSI
                        set collect hinsi_id. default [6, 15]
  -e EXCLUDE, --exclude EXCLUDE
                        exclude files

参考

貢献

このリポジトリは表記ゆれ検出の実験的な実装をしています。 アイディアや PR を歓迎します。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tails-of-words-2.0.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

tails_of_words-2.0.0-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file tails-of-words-2.0.0.tar.gz.

File metadata

  • Download URL: tails-of-words-2.0.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for tails-of-words-2.0.0.tar.gz
Algorithm Hash digest
SHA256 7556b40cf3eae3d3f7fa2dc4924385dd87745d79f0593e7680f1f57bb81516b1
MD5 ac4da3e3118931a48c779bd84493f122
BLAKE2b-256 fcc7a9a3108e283ca25f2b5fefe6189bd59b1bea392947741ec6944d310790c7

See more details on using hashes here.

File details

Details for the file tails_of_words-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tails_of_words-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b9db636fdb3330934b592bf7dd93e4377a1a3467e6ea26baa1a1871e88c4844
MD5 7c7e9cc40cadcca61482c3c6ecf79fea
BLAKE2b-256 f192e6c5804d0672eac6e3044e95304c9327480340013f49277d73e8a63a2df9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page