Skip to main content

tokenizer tool

Project description

Description

Most tokenizers are eithor too cumbersom (Neural Network based), or too simple. This simple rule based tokenizer is type, small, and sufficient good. Specially, it handles long strings very often parsed wrong by some simple tokenizers, deal url, email, long digits rather well.

Try with the following script:

easy_tokenizer -s input_text

or

easy_tokenizer -f input_file

CI Status

https://travis-ci.org/tilaboy/easy-tokenizer.svg?branch=master Documentation Status Updates

Requirements

Python 3.6+

Installation

pip install easy-tokenizer

Usage

  • easy-tokenizer:

    input:

    • string: input string to tokenize

    • filename: input text file to tokenize

    • output: output filename, optional. print out to STDOUT when not set

    output:

    • a sequence of space separated tokens

examples:

# string input
easy-tokenizer -s "this is   a simple test."

easy-tokenizer -f foo.txt
easy-tokenizer -f foo.txt -o bar.txt

output will be “this is a simple test .”

Development

To install package and its dependencies, run the following from project root directory:

python setup.py install

To work the code and develop the package, run the following from project root directory:

python setup.py develop

To run unit tests, execute the following from the project root directory:

python setup.py test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_tokenizer-0.0.10.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

easy_tokenizer-0.0.10-py2.py3-none-any.whl (8.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file easy_tokenizer-0.0.10.tar.gz.

File metadata

  • Download URL: easy_tokenizer-0.0.10.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/38.4.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.4

File hashes

Hashes for easy_tokenizer-0.0.10.tar.gz
Algorithm Hash digest
SHA256 d2da094a2e61637ae4db1d6fe5ae85ad6596e056a7c59f2d094d2ab31d937c63
MD5 ef26ccae9b106844186661a2218ad267
BLAKE2b-256 2c946712dd75e5ace020714c98e6a17303fa1b5d4588e1f252deb88cde9bcf1c

See more details on using hashes here.

File details

Details for the file easy_tokenizer-0.0.10-py2.py3-none-any.whl.

File metadata

  • Download URL: easy_tokenizer-0.0.10-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/38.4.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.4

File hashes

Hashes for easy_tokenizer-0.0.10-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 29577595625fbbba7325a28dbb8d3f25510e01abbf38fd81b5e50c4666e02f52
MD5 93ad5a67caefef773321cb5c8c2cdb76
BLAKE2b-256 111634cfdbbe64c1e7649d4f57eb009a3e7b558f8ff0271417c16f53baea14fd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page