Simple tokenizers: n-grams and chargrams splitting, white space splitting, or splitting using configurable REGEX expression, or detection into context tokenization. Based on ExtractionString object from the extractionstring package.
Project description
Tokenization for language processing
This package contains some basic tools allowing to cut a string in sub-parts (cf. Wikipedia), called Token.
iamtokenizing classes allow basic tokenization of text, such as
- word splitting, n-gram splitting, (using
NGramsclass) - char-gram splitting of arbitrary size (using
CharGramsclass).
NGrams also accepts any REGular EXpression (REGEX) to match pattern that will serve as splitting string. The class RegexDetector also allows to extract the REGEX pattern as token. In addition, ContextDetector allow to split text on some REGEX, and to detect inside these splits an other REGEX, keeping some organisation (called context) of the text between the two detection and splitting scales.
Installation
- The documentation is available on https://nlp.frama.io/iamtokenizing/
- The PyPi package is available on https://pypi.org/project/iamtokenizing/
- The official repository is on https://framagit.org/nlp/iamtokenizing
From Python Package Index (PIP)
Simply run
pip install iamtokenizing
is sufficient.
From the repository
The official repository is on https://framagit.org/nlp/iamtokenizing
Once the repository has been downloaded (or cloned), one can install this package using pip :
git clone https://framagit.org/nlp/iamtokenizing.git
cd iamtokenizing/
pip install .
Once installed, one can run some tests using
cd tests/
python3 -m unittest -v
(verbosity -v is an option).
Basic examples
Basic examples can be found in the documentation.
Versions
- Versions before 0.4 only present the
TokenandTokensclasses. They have been splitted after in three classes, namedSpan,TokenandTokens. Importantly, the methodsToken.appendandToken.removeno longer exist in the next version. They have been replaced byToken.append_range,Token.append_ranges,Token.remove_rangeandToken.remove_ranges. - Version 0.4 add the class
SpantoTokenandTokens.Spanhandles the sub-parts splitting of a given string, whereasTokenandTokensnow consumesSpanobjects and handle the attributes of theToken. - From version 0.5, one has split the basic tools
Span,TokenandTokensfrom theiamtokenizingpackage (see https://pypi.org/project/iamtokenizing/). Only the advanced tokenizer are now present in the packageiamtokenizing, which depends on the packagetokenspan. The objectsSpan,TokenandTokenscan be called as before from the newly deployed packagetokenspan, available on https://pypi.org/project/tokenspan/.
About us
Package developped for Natural Language Processing at IAM : Unité d'Informatique et d'Archivistique Médicale, Service d'Informatique Médicale, Pôle de Santé Publique, Centre Hospitalo-Universitaire (CHU) de Bordeaux, France.
You are kindly encouraged to flag any trouble, and to propose ameliorations and/or suggestions to the authors, via issue or merge requests.
Last version : August 6, 2021
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file iamtokenizing-0.7.0.tar.gz.
File metadata
- Download URL: iamtokenizing-0.7.0.tar.gz
- Upload date:
- Size: 24.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4f5f12ebbb640cc6d9f15d03048528249b7fd99a50d755d3ace665f9a32cebd
|
|
| MD5 |
27eec3d19230792d37d5aa4ee6001459
|
|
| BLAKE2b-256 |
cf030b7f8c76479ad9958a35d6f17067cf42c042362c3c34c3ff960e6c2e7844
|