Thai tokenizer, POS-tagger and sentence segmenter.
Project description
This package provides utilities for Thai sentence segmentation, word tokenization and POS tagging. Because of how sentence segmentation is performed, prior tokenization and POS tagging is required and therefore also provided with this package.
Besides functions for doing sentence segmentation, tokenization, tokenization with POS tagging for single sentence strings, there are also functions for working with large amounts of data in a streaming fashion. They are also accessible with a commandline script thai-segmenter that accepts file or standard in/output. Options allow working with meta-headers or tabulator separated data files.
The main functionality for sentence segmentation was extracted, reformatted and slightly rewritten from another project, Question Generation Thai.
LongLexTo is used as state-of-the-art word/lexeme tokenizer. An implementation was packaged in the above project but there are also (original?) versions github and homepage. To better use it for bulk processing in Python, it has been rewritten from Java to pure Python.
For POS tagging a Viterbi-Model with the annotated Orchid-Corpus is used, paper.
Free software: MIT license
Installation
pip install thai-segmenter
Documentation
To use the project:
sentence = """foo bar 1234"""
# [A] Sentence Segmentation
from thai_segmenter.tasks import sentence_segment
# or even easier:
from thai_segmenter import sentence_segment
sentences = sentence_segment(sentence)
for sentence in sentences:
print(str(sentence))
# [B] Lexeme Tokenization
from thai_segmenter import tokenize
tokens = tokenize(sentence)
for token in tokens:
print(token, end=" ", flush=True)
# [C] POS Tagging
from thai_segmenter import tokenize_and_postag
sentence_info = tokenize_and_postag(sentence)
for token, pos in sentence_info.pos:
print("{}|{}".format(token, pos), end=" ", flush=True)
See more possibilities in tasks.py or cli.py.
Streaming larger sequences can be achieved like this:
# Streaming
sentences = ["sent1\n", "sent2\n", "sent3\n"] # or any iterable (like File)
from thai_segmenter import line_sentence_segmenter
sentences_segmented = line_sentence_segmenter(sentences)
Commandline tool
This project also provides a nifty commandline tool thai-segmenter that does most of the work for you:
usage: thai-segmenter [-h] {clean,sentseg,tokenize,tokpos} ...
Thai Segmentation utilities.
optional arguments:
-h, --help show this help message and exit
Tasks:
{clean,sentseg,tokenize,tokpos}
clean Clean input from non-thai and blank lines.
sentseg Sentence segmentize input lines.
tokenize Tokenize input lines.
tokpos Tokenize and POS-tag input lines.
You can run sentence segmentation like this:
thai-segmenter sentseg -i input.txt -o output.txt
or even pipe data:
cat input.txt | thai-segmenter sentseg > output.txt
Use -h/--help to get more information about possible control flow options.
You can run it somewhat interactively with:
thai-segmenter tokpos --stats
and standard input and output are used. Lines terminated with Enter are immediatly processed and printed. Stop work with key combination Ctrl + D and the --stats parameter will helpfully output some statistics.
WebApp
The project also provides a demo WebApp (using Flask and gevent) that can be installed with:
pip install -e .[webapp]
and then simply run (in the foreground):
thai-segmenter-webapp
Consider running it in a screen session.
# create the screen detached and then attach
screen -dmS thai-senseg-webapp
screen -r thai-senseg-webapp
# in the screen:
thai-segmenter-webapp
# and detach with keys [Ctrl]+[D]
Please note that it only is a demo webapp to test and visualize how the sentence segmentor works.
Development
To install the package for development:
git clone https://github.com/Querela/thai-segmenter.git cd thai-segmenter/ pip install -e .[dev]
After changing the source, run auto code formatting with:
isort <file>.py black <file>.py
And check it afterwards with:
flake8 <file>.py
The setup.py also contains the flake8 subcommand as well as an extended clean command.
Tests
To run the all tests run:
tox
You can also optionally run pytest alone:
pytest
Or with:
python setup.py test
Note, to combine the coverage data from all the tox environments run:
Windows |
set PYTEST_ADDOPTS=--cov-append tox |
---|---|
Other |
PYTEST_ADDOPTS=--cov-append tox |
Changelog
0.4.2 (2023-08-23)
Fix signature of tasks.tokenize_and_postag function
Update tox.ini to include newer python version, as well as older parameters and flags
Reformat und Lint
0.4.1 (2019-04-08)
Fix tokenization / tokenization + POS tagging: return words instead of subwords
Add --escape-special and --subwords parameter to CLI script for tokenization. Allows tokenization to further tokenize unknown words (e. g. names) as well as escape special characters with angle bracket entities.
0.4.0 (2019-04-08)
Add demo webapp with sentence segmentation. (NOTE: Running both the webapp and (batch) sentence segmentation at the same time from the same installation is not recommeded. It can have unexpected side-effects.)
Some reformat of README.rst
0.3.3 (2019-04-07)
Fix duplicate names (class/method for sentence_segment), rename class to sentence_segmenter (.py).
0.3.2 (2019-04-07)
Add twine to extras dependencies.
Publish module on PyPI. (Only sdist, bdist_wheel can’t be built currently.)
Fix some TravisCI warnings.
0.3.1 (2019-04-07)
Add tasks to __init__.py for easier access.
0.3.0 (2019-04-06)
Refactor tasks into tasks.py to enable better import in case of embedding thai-segmenter into other projects.
Have it almost release ready. :-)
Add some more parameters to functions (optional header detection function)
Flesh out README.rst with examples and descriptions.
Add Changelog items.
0.2.1 / 0.2.2 (2019-04-05)
Many changes, bumpversion needs to run where .bumpversion.cfg is located else it silently fails …
Strip Typehints and add support for Python3.5 again.
Add CLI tasks for cleaning, sentseg, tokenize, pos-tagging.
Add various params, e. g. for selecting columns, skipping headers.
Fix many bugs for TravisCI (isort, flake8)
Use iterators / streaming approach for file input/output.
0.2.0 (2019-04-05)
Remove support of Python 2.7 and lower equal to Python 3.5 because of Typehints.
Added CLI skeleton.
Add really good setup.py. (with black, flake8)
0.1.0 (2019-04-05)
First release version as package.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.