Project description

About

Toki library was originally developed by Tomasz Śniatowski and Adam Radziszewski at Wroclaw University of Science and Technology. The main purpose of the library was to provide fast SRX-based tokenizer. The following python library is a set of python bindings to C++ toki that has been further developed at Alphamoon.

Original toki has been released under GNU LGPL 3.0. The sources may be obtained from the git repositories:

git clone http://nlp.pwr.wroc.pl/corpus2.git # contains pwrutils library that is needed for building toki
git clone http://nlp.pwr.wroc.pl/toki.git

To build the codes you will need CMake 2.8 or later. Besides, you will need:

ICU 4.2
Boost 1.41 or later (tested with 1.41 and 1.42)
Loki (libloki-dev)
libxml++2.6 (for SRX support)
libpwrutils from corpus2 repository (its build process is based on CMake, see the project site)

To create a working tokeniser, instantiate Toki::LayerTokenizer. There are several constructors available; the simplest one assumes using the default configuration (for Polish). To access a named configuration, use Toki::get_named_config(config_name) and pass the acquired object to Toki::LayerTokenizer constructor.
To create a working tokeniser with sentence-splitter, first instantiate a Toki::LayerTokenizer object and then wrap a Toki::SentenceSplitter around it. The sentencer object contains a convenient has_more-get_next_sentence interface. The default config loads sentence-splitting rules so is suitable for this purpose. NOTE: when using a custom config, check whether it contains working sentence-splitting rules. If it doesn't, Toki::SentenceSplitter will buffer all the input and finally produce an enormous sentence containing all the tokens.

Examples

For now, Python interface is simple and allows only for sentence splitting and tokenizing within the sequence with polish as a default language.

Sentence splitting:

import toki
tokenizer = toki.Toki()
tokenizer.get_all_sentences("To jest zdanie. To jest np. inne zdanie.")

Sentence tokenizing:

import toki
tokenizer = toki.Toki()
tokenizer.get_all_sentences_tokenized("To jest zdanie. To jest np. inne zdanie.")

More languages will be supported in upcoming releases.

It is recommended to build package from source if possible to make use of AVX and other CPU instruction. Package originally has been built with core2 optimization so any CPU older than that or which does not have MMX, SSE, SSE2, SSE3 and SSSE3 must build package from source.

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.2

Apr 8, 2020

0.1.1

Apr 8, 2020

0.1.0

Apr 8, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pytoki-0.1.2-cp38-cp38-manylinux2010_x86_64.whl (14.1 MB view hashes)

Uploaded Apr 8, 2020 CPython 3.8 manylinux: glibc 2.12+ x86-64

pytoki-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl (14.2 MB view hashes)

Uploaded Apr 8, 2020 CPython 3.7m manylinux: glibc 2.12+ x86-64

pytoki-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl (14.2 MB view hashes)

Uploaded Apr 8, 2020 CPython 3.6m manylinux: glibc 2.12+ x86-64

pytoki-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl (14.2 MB view hashes)

Uploaded Apr 8, 2020 CPython 3.5m manylinux: glibc 2.12+ x86-64

Hashes for pytoki-0.1.2-cp38-cp38-manylinux2010_x86_64.whl

Hashes for pytoki-0.1.2-cp38-cp38-manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`90bb2eb9d2018bc299ffcbdf0535d638e91986aa5f43b4142a16be6ada6f4d41`
MD5	`cc8fd9c6c1cefec89fc54a9a5177e410`
BLAKE2b-256	`e75ee9679da8842af0bd9d31c66f61841f0424655e63f39084865f0a366f3407`

Hashes for pytoki-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl

Hashes for pytoki-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`0f338e8f410eb872422255d8975e76aa0c72e3cc9c43e6c36a5ee7c8fbbb3032`
MD5	`32915d070f2f1d750ce0a00a891b85ce`
BLAKE2b-256	`ce97e1e1d21afc3ac20fe85c2f624f8974f80827aa85aeaab55deeb919ab8f21`

Hashes for pytoki-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl

Hashes for pytoki-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`2ddb4c9197aeb19b2ceddd1e0bef9c7075f3c969acf4a20011933b9c8e827cd8`
MD5	`5eb9c0ce3f1774412cabe5ba825e01f4`
BLAKE2b-256	`1481e4528f5fb2fed3cf8d7e4755fbf1c25a9273b6edbd7411a8c74c9be6e17d`

Hashes for pytoki-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl

Hashes for pytoki-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`5b0519a6836077048a0e0a9332259571c38fac290a20f9438b906b76370b5889`
MD5	`d1365776a25093b0cff8b6cd051ca756`
BLAKE2b-256	`d55e6b094a97a6a382fcfcfaf547d8c0ea3d54df09214621c9a30665626371f4`