Skip to main content

Thai tokenizer, POS-tagger and sentence segmenter.

Project description

This package provides utilities for Thai sentence segmentation, word tokenization and POS tagging. Because of how sentence segmentation is performed, prior tokenization and POS tagging is required and therefore also provided with this package.

Besides functions for doing sentence segmentation, tokenization, tokenization with POS tagging for single sentence strings, there are also functions for working with large amounts of data in a streaming fashion. They are also accessible with a commandline script thai-segmenter that accepts file or standard in/output. Options allow working with meta-headers or tabulator separated data files.

The main functionality for sentence segmentation was extracted, reformatted and slightly rewritten from another project, Question Generation Thai.

LongLexTo is used as state-of-the-art word/lexeme tokenizer. An implementation was packaged in the above project but there are also (original?) versions github and homepage. To better use it for bulk processing in Python, it has been rewritten from Java to pure Python.

For POS tagging a Viterbi-Model with the annotated Orchid-Corpus is used, paper.

  • Free software: MIT license

Installation

pip install thai-segmenter

Documentation

To use the project:

sentence = """foo bar 1234"""

# [A] Sentence Segmentation
from thai_segmenter.tasks import sentence_segment
# or even easier:
from thai_segmenter import sentence_segment
sentences = sentence_segment(sentence)

for sentence in sentences:
    print(str(sentence))

# [B] Lexeme Tokenization
from thai_segmenter import tokenize
tokens = tokenize(sentence)
for token in tokens:
    print(token, end=" ", flush=True)

# [C] POS Tagging
from thai_segmenter import tokenize_and_postag
sentence_info = tokenize_and_postag(sentence)
for token, pos in sentence_info.pos:
    print("{}|{}".format(token, pos), end=" ", flush=True)

See more possibilities in tasks.py or cli.py.

Streaming larger sequences can be achieved like this:

# Streaming
sentences = ["sent1\n", "sent2\n", "sent3\n"]  # or any iterable (like File)
from thai_segmenter import line_sentence_segmenter
sentences_segmented = line_sentence_segmenter(sentences)

Commandline tool

This project also provides a nifty commandline tool thai-segmenter that does most of the work for you:

usage: thai-segmenter [-h] {clean,sentseg,tokenize,tokpos} ...

Thai Segmentation utilities.

optional arguments:
  -h, --help            show this help message and exit

Tasks:
  {clean,sentseg,tokenize,tokpos}
    clean               Clean input from non-thai and blank lines.
    sentseg             Sentence segmentize input lines.
    tokenize            Tokenize input lines.
    tokpos              Tokenize and POS-tag input lines.

You can run sentence segmentation like this:

thai-segmenter sentseg -i input.txt -o output.txt

or even pipe data:

cat input.txt | thai-segmenter sentseg > output.txt

Use -h/--help to get more information about possible control flow options.

You can run it somewhat interactively with:

thai-segmenter tokpos --stats

and standard input and output are used. Lines terminated with Enter are immediatly processed and printed. Stop work with key combination Ctrl + D and the --stats parameter will helpfully output some statistics.

WebApp

The project also provides a demo WebApp (using Flask and gevent) that can be installed with:

pip install -e .[webapp]

and then simply run (in the foreground):

thai-segmenter-webapp

Consider running it in a screen session.

# create the screen detached and then attach
screen -dmS thai-senseg-webapp
screen -r thai-senseg-webapp

# in the screen:
thai-segmenter-webapp

# and detach with keys [Ctrl]+[D]

Please note that it only is a demo webapp to test and visualize how the sentence segmentor works.

Development

To install the package for development:

git clone https://github.com/Querela/thai-segmenter.git
cd thai-segmenter/
pip install -e .[dev]

After changing the source, run auto code formatting with:

isort <file>.py
black <file>.py

And check it afterwards with:

flake8 <file>.py

The setup.py also contains the flake8 subcommand as well as an extended clean command.

Tests

To run the all tests run:

tox

You can also optionally run pytest alone:

pytest

Or with:

python setup.py test

Note, to combine the coverage data from all the tox environments run:

Windows

set PYTEST_ADDOPTS=--cov-append
tox

Other

PYTEST_ADDOPTS=--cov-append tox

Changelog

0.4.2 (2023-08-23)

  • Fix signature of tasks.tokenize_and_postag function

  • Update tox.ini to include newer python version, as well as older parameters and flags

  • Reformat und Lint

0.4.1 (2019-04-08)

  • Fix tokenization / tokenization + POS tagging: return words instead of subwords

  • Add --escape-special and --subwords parameter to CLI script for tokenization. Allows tokenization to further tokenize unknown words (e. g. names) as well as escape special characters with angle bracket entities.

0.4.0 (2019-04-08)

  • Add demo webapp with sentence segmentation. (NOTE: Running both the webapp and (batch) sentence segmentation at the same time from the same installation is not recommeded. It can have unexpected side-effects.)

  • Some reformat of README.rst

0.3.3 (2019-04-07)

  • Fix duplicate names (class/method for sentence_segment), rename class to sentence_segmenter (.py).

0.3.2 (2019-04-07)

  • Add twine to extras dependencies.

  • Publish module on PyPI. (Only sdist, bdist_wheel can’t be built currently.)

  • Fix some TravisCI warnings.

0.3.1 (2019-04-07)

  • Add tasks to __init__.py for easier access.

0.3.0 (2019-04-06)

  • Refactor tasks into tasks.py to enable better import in case of embedding thai-segmenter into other projects.

  • Have it almost release ready. :-)

  • Add some more parameters to functions (optional header detection function)

  • Flesh out README.rst with examples and descriptions.

  • Add Changelog items.

0.2.1 / 0.2.2 (2019-04-05)

  • Many changes, bumpversion needs to run where .bumpversion.cfg is located else it silently fails …

  • Strip Typehints and add support for Python3.5 again.

  • Add CLI tasks for cleaning, sentseg, tokenize, pos-tagging.

  • Add various params, e. g. for selecting columns, skipping headers.

  • Fix many bugs for TravisCI (isort, flake8)

  • Use iterators / streaming approach for file input/output.

0.2.0 (2019-04-05)

  • Remove support of Python 2.7 and lower equal to Python 3.5 because of Typehints.

  • Added CLI skeleton.

  • Add really good setup.py. (with black, flake8)

0.1.0 (2019-04-05)

  • First release version as package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thai-segmenter-0.4.2.tar.gz (2.4 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page