Skip to main content

toki Python bindings

Project description

About

Toki library was originally developed by Tomasz Śniatowski and Adam Radziszewski at Wroclaw University of Science and Technology. The main purpose of the library was to provide fast SRX-based tokenizer. The following python library is a set of python bindings to C++ toki that has been further developed at Alphamoon.

Original toki has been released under GNU LGPL 3.0. The sources may be obtained from the git repositories:

git clone http://nlp.pwr.wroc.pl/corpus2.git # contains pwrutils library that is needed for building toki
git clone http://nlp.pwr.wroc.pl/toki.git

To build the codes you will need CMake 2.8 or later. Besides, you will need:

  • ICU 4.2
  • Boost 1.41 or later (tested with 1.41 and 1.42)
  • Loki (libloki-dev)
  • libxml++2.6 (for SRX support)
  • libpwrutils from corpus2 repository (its build process is based on CMake, see the project site)
  1. To create a working tokeniser, instantiate Toki::LayerTokenizer. There are several constructors available; the simplest one assumes using the default configuration (for Polish). To access a named configuration, use Toki::get_named_config(config_name) and pass the acquired object to Toki::LayerTokenizer constructor.
  2. To create a working tokeniser with sentence-splitter, first instantiate a Toki::LayerTokenizer object and then wrap a Toki::SentenceSplitter around it. The sentencer object contains a convenient has_more-get_next_sentence interface. The default config loads sentence-splitting rules so is suitable for this purpose. NOTE: when using a custom config, check whether it contains working sentence-splitting rules. If it doesn't, Toki::SentenceSplitter will buffer all the input and finally produce an enormous sentence containing all the tokens.

Examples

For now, Python interface is simple and allows only for sentence splitting and tokenizing within the sequence with polish as a default language.

Sentence splitting:

import toki
tokenizer = toki.Toki()
tokenizer.get_all_sentences("To jest zdanie. To jest np. inne zdanie.")

Sentence tokenizing:

import toki
tokenizer = toki.Toki()
tokenizer.get_all_sentences_tokenized("To jest zdanie. To jest np. inne zdanie.")

More languages will be supported in upcoming releases.

It is recommended to build package from source if possible to make use of AVX and other CPU instruction. Package originally has been built with core2 optimization so any CPU older than that or which does not have MMX, SSE, SSE2, SSE3 and SSSE3 must build package from source.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pytoki-0.1.2-cp38-cp38-manylinux2010_x86_64.whl (14.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

pytoki-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

pytoki-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

pytoki-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

File details

Details for the file pytoki-0.1.2-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pytoki-0.1.2-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 14.1 MB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for pytoki-0.1.2-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 90bb2eb9d2018bc299ffcbdf0535d638e91986aa5f43b4142a16be6ada6f4d41
MD5 cc8fd9c6c1cefec89fc54a9a5177e410
BLAKE2b-256 e75ee9679da8842af0bd9d31c66f61841f0424655e63f39084865f0a366f3407

See more details on using hashes here.

File details

Details for the file pytoki-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pytoki-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 14.2 MB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for pytoki-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 0f338e8f410eb872422255d8975e76aa0c72e3cc9c43e6c36a5ee7c8fbbb3032
MD5 32915d070f2f1d750ce0a00a891b85ce
BLAKE2b-256 ce97e1e1d21afc3ac20fe85c2f624f8974f80827aa85aeaab55deeb919ab8f21

See more details on using hashes here.

File details

Details for the file pytoki-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pytoki-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 14.2 MB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for pytoki-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 2ddb4c9197aeb19b2ceddd1e0bef9c7075f3c969acf4a20011933b9c8e827cd8
MD5 5eb9c0ce3f1774412cabe5ba825e01f4
BLAKE2b-256 1481e4528f5fb2fed3cf8d7e4755fbf1c25a9273b6edbd7411a8c74c9be6e17d

See more details on using hashes here.

File details

Details for the file pytoki-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pytoki-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 14.2 MB
  • Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for pytoki-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 5b0519a6836077048a0e0a9332259571c38fac290a20f9438b906b76370b5889
MD5 d1365776a25093b0cff8b6cd051ca756
BLAKE2b-256 d55e6b094a97a6a382fcfcfaf547d8c0ea3d54df09214621c9a30665626371f4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page