tokenizer tool
Project description
Easy-Tokenizer
Description
Most tokenizers are eithor too cumbersom (Neural Network based), or too simple. This simple rule based tokenizer is type, small, and sufficient good. Specially, it handles long strings very often parsed wrong by some simple tokenizers, deal url, email, long digits rather well.
Try with the following script:
easy_tokenizer -s input_text
or
easy_tokenizer -f input_file
CI Status
Requirements
Python 3.6+
Installation
pip install easy-tokenizer
Usage
easy-tokenizer:
input:
string: input string to tokenize
filename: input text file to tokenize
output: output filename, optional. print out to STDOUT when not set
output:
a sequence of space separated tokens
examples:
# string input easy-tokenizer -s "this is a simple test." easy-tokenizer -f foo.txt easy-tokenizer -f foo.txt -o bar.txt
output will be “this is a simple test .”
Development
To install package and its dependencies, run the following from project root directory:
python setup.py install
To work the code and develop the package, run the following from project root directory:
python setup.py develop
To run unit tests, execute the following from the project root directory:
python setup.py test
0.0.9 (2020-01-16)
[BUG] fixed infinite loop for long url
0.0.8 (2019-11-14)
[New] added a function module for char/string normalization
0.0.7 (2019-11-14)
[Bugfix] update the url patter to fix the regexp loop for long url string
0.0.5 (2019-10-23)
[Bugfix] encryption and doc generation
0.0.3 (2019-10-23)
Test the CI/CD and auto documentation generation
0.0.2 (2019-10-23)
support script to output result to a file, add documentation
0.0.1 (2019-10-22)
Add the first version of the tokenizer
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file easy_tokenizer-0.0.9.tar.gz
.
File metadata
- Download URL: easy_tokenizer-0.0.9.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25c2917bc84f67b75916ac510df1c980930fdcb765c9f5d91a0876eb236cf847 |
|
MD5 | a6c6e10d1128447c3a870eed0fa9c984 |
|
BLAKE2b-256 | cedaf5ccccae14ee2395c676ca0d4d6574986deb0b5bcbeba2205445e0dd5d87 |
File details
Details for the file easy_tokenizer-0.0.9-py2.py3-none-any.whl
.
File metadata
- Download URL: easy_tokenizer-0.0.9-py2.py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2afcc03a8711cb0af3a733f88754784ebb26f33ca79369f2a06c2beaf08f81b2 |
|
MD5 | ed19596947e8677ed49a1f0e2fe51cad |
|
BLAKE2b-256 | 050a6d3f18eeb85e70b12b98e2d8a87140e06406e95f2f7557cbf53d61fa91d7 |