toki Python bindings
Project description
About
Toki library was originally developed by Tomasz Śniatowski and Adam Radziszewski at Wroclaw University of Science and Technology. The main purpose of the library was to provide fast SRX-based tokenizer. The following python library is a set of python bindings to C++ toki that has been further developed at Alphamoon.
Original toki has been released under GNU LGPL 3.0. The sources may be obtained from the git repositories:
git clone http://nlp.pwr.wroc.pl/corpus2.git # contains pwrutils library that is needed for building toki
git clone http://nlp.pwr.wroc.pl/toki.git
To build the codes you will need CMake 2.8 or later. Besides, you will need:
- ICU 4.2
- Boost 1.41 or later (tested with 1.41 and 1.42)
- Loki (libloki-dev)
- libxml++2.6 (for SRX support)
- libpwrutils from corpus2 repository (its build process is based on CMake, see the project site)
- To create a working tokeniser, instantiate
Toki::LayerTokenizer
. There are several constructors available; the simplest one assumes using the default configuration (for Polish). To access a named configuration, useToki::get_named_config
(config_name) and pass the acquired object toToki::LayerTokenizer
constructor. - To create a working tokeniser with sentence-splitter, first instantiate a
Toki::LayerTokenizer
object and then wrap aToki::SentenceSplitter
around it. The sentencer object contains a convenient has_more-get_next_sentence interface. The default config loads sentence-splitting rules so is suitable for this purpose. NOTE: when using a custom config, check whether it contains working sentence-splitting rules. If it doesn't,Toki::SentenceSplitter
will buffer all the input and finally produce an enormous sentence containing all the tokens.
Examples
For now, Python interface is simple and allows only for sentence splitting and tokenizing within the sequence with polish as a default language.
Sentence splitting:
import toki
tokenizer = toki.Toki()
tokenizer.get_all_sentences("To jest zdanie. To jest np. inne zdanie.")
Sentence tokenizing:
import toki
tokenizer = toki.Toki()
tokenizer.get_all_sentences_tokenized("To jest zdanie. To jest np. inne zdanie.")
More languages will be supported in upcoming releases.
It is recommended to build package from source if possible to make use of AVX and other CPU instruction. Package originally has been built with core2
optimization so any CPU older than that or which does not have MMX
, SSE
, SSE2
, SSE3
and SSSE3
must build package from source.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file pytoki-0.1.2-cp38-cp38-manylinux2010_x86_64.whl
.
File metadata
- Download URL: pytoki-0.1.2-cp38-cp38-manylinux2010_x86_64.whl
- Upload date:
- Size: 14.1 MB
- Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 90bb2eb9d2018bc299ffcbdf0535d638e91986aa5f43b4142a16be6ada6f4d41 |
|
MD5 | cc8fd9c6c1cefec89fc54a9a5177e410 |
|
BLAKE2b-256 | e75ee9679da8842af0bd9d31c66f61841f0424655e63f39084865f0a366f3407 |
File details
Details for the file pytoki-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: pytoki-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl
- Upload date:
- Size: 14.2 MB
- Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f338e8f410eb872422255d8975e76aa0c72e3cc9c43e6c36a5ee7c8fbbb3032 |
|
MD5 | 32915d070f2f1d750ce0a00a891b85ce |
|
BLAKE2b-256 | ce97e1e1d21afc3ac20fe85c2f624f8974f80827aa85aeaab55deeb919ab8f21 |
File details
Details for the file pytoki-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: pytoki-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl
- Upload date:
- Size: 14.2 MB
- Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ddb4c9197aeb19b2ceddd1e0bef9c7075f3c969acf4a20011933b9c8e827cd8 |
|
MD5 | 5eb9c0ce3f1774412cabe5ba825e01f4 |
|
BLAKE2b-256 | 1481e4528f5fb2fed3cf8d7e4755fbf1c25a9273b6edbd7411a8c74c9be6e17d |
File details
Details for the file pytoki-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: pytoki-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl
- Upload date:
- Size: 14.2 MB
- Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b0519a6836077048a0e0a9332259571c38fac290a20f9438b906b76370b5889 |
|
MD5 | d1365776a25093b0cff8b6cd051ca756 |
|
BLAKE2b-256 | d55e6b094a97a6a382fcfcfaf547d8c0ea3d54df09214621c9a30665626371f4 |