Skip to main content

tokenizer tool

Project description

Easy-Tokenizer

Description

Most tokenizers are eithor too cumbersom (Neural Network based), or too simple. This simple rule based tokenizer is type, small, and sufficient good. Specially, it handles long strings very often parsed wrong by some simple tokenizers, deal url, email, long digits rather well.

Try with the following script:

easy_tokenizer -s input_text

or

easy_tokenizer -f input_file

CI Status

https://travis-ci.org/tilaboy/easy-tokenizer.svg?branch=master Documentation Status Updates

Requirements

Python 3.6+

Installation

pip install easy-tokenizer

Usage

  • easy-tokenizer:

    input:

    • string: input string to tokenize

    • filename: input text file to tokenize

    • output: output filename, optional. print out to STDOUT when not set

    output:

    • a sequence of space separated tokens

examples:

# string input
easy-tokenizer -s "this is   a simple test."

easy-tokenizer -f foo.txt
easy-tokenizer -f foo.txt -o bar.txt

output will be “this is a simple test .”

Development

To install package and its dependencies, run the following from project root directory:

python setup.py install

To work the code and develop the package, run the following from project root directory:

python setup.py develop

To run unit tests, execute the following from the project root directory:

python setup.py test

0.0.9 (2020-01-16)

  • [BUG] fixed infinite loop for long url

0.0.8 (2019-11-14)

  • [New] added a function module for char/string normalization

0.0.7 (2019-11-14)

  • [Bugfix] update the url patter to fix the regexp loop for long url string

0.0.5 (2019-10-23)

  • [Bugfix] encryption and doc generation

0.0.3 (2019-10-23)

  • Test the CI/CD and auto documentation generation

0.0.2 (2019-10-23)

  • support script to output result to a file, add documentation

0.0.1 (2019-10-22)

  • Add the first version of the tokenizer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_tokenizer-0.0.9.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

easy_tokenizer-0.0.9-py2.py3-none-any.whl (9.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file easy_tokenizer-0.0.9.tar.gz.

File metadata

  • Download URL: easy_tokenizer-0.0.9.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for easy_tokenizer-0.0.9.tar.gz
Algorithm Hash digest
SHA256 25c2917bc84f67b75916ac510df1c980930fdcb765c9f5d91a0876eb236cf847
MD5 a6c6e10d1128447c3a870eed0fa9c984
BLAKE2b-256 cedaf5ccccae14ee2395c676ca0d4d6574986deb0b5bcbeba2205445e0dd5d87

See more details on using hashes here.

File details

Details for the file easy_tokenizer-0.0.9-py2.py3-none-any.whl.

File metadata

  • Download URL: easy_tokenizer-0.0.9-py2.py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for easy_tokenizer-0.0.9-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2afcc03a8711cb0af3a733f88754784ebb26f33ca79369f2a06c2beaf08f81b2
MD5 ed19596947e8677ed49a1f0e2fe51cad
BLAKE2b-256 050a6d3f18eeb85e70b12b98e2d8a87140e06406e95f2f7557cbf53d61fa91d7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page