Basic tools to tokenize (i.e. to construct atomic-entities/sub-strings of) a string, for Natural Language Processing (NLP). Usefull also for annotation, tree parsing, entity linking, ... (in fact, anything that links a string or its sub-parts to an other object). Key concepts are versatility to other librairies, and freedom to define many concepts on top of a string.
Project description
Tokenization for language processing
This package contains some generic configurable tools allowing to cut a string in sub-parts (cf. Wikipedia), called Token
, and to group them into sequences called Tokens
. A Token
is a sub-string from a parent string (say the initial complete text), with associated ranges of non-overlaping characters. The number of associated ranges is arbitrary. A Tokens
is a collection of Token
. These two classes allow to associate to any Token
a collection of attributes in a versatile way, and to pass these attributes from one object to the next one while cutting Token
into sub-parts (collected as Tokens
) and eventually re-merging them into larger Token
.
Token
and Tokens
classes allow basic tokenization of text, such as word splitting, n-gram splitting, char-gram splitting of arbitrary size. In addition, it allows to associate several non-overlapping sub-strings into a given Token
, and to associate arbitrary attributes to these parts. One can compare two different Token
objects in terms of their attributes and/or ranges. One can also apply basic mathematical operations and logic to them (+,-,*,/) corresponding to the union, difference, intersection and symmetric difference implemented by Python set ; here the sets are the ranges of position from the parent string.
Installation
From Python Package Index (PIP)
Simply run
pip install tokenspan
is sufficient.
From the repository
The official repository is on https://framagit.org/nlp/tokenspan. To install the package from the repository, run the following command lines
git clone https://framagit.org/nlp/tokenspan.git
cd tokenspan/
pip install .
Once installed, one can run some tests using
cd tests/
python3 -m unittest -v
(verbosity -v
is an option).
Philosophy of this library
In tokenspan
, one thinks of a string as a collection of integers: the position of each character in the string. For instance
'Simple string for demonstration and for illustration.' # the parent string
'01234567891123456789212345678931234567894123456789512' # the positions
' string for illustration ' # the Span span1
' 789112 678 412345678951 ' # the ranges
'Simple ' # the Span span2
'012345 ' # the ranges
To define the Span
'string for illustration'
consists in selecting the positions [range(7,13),range(38,39),range(40,52)]
from the parent string, and the Span
'simple'
is defined by the positions [range(0,6),]
. Underneath, one keeps the principal methods associated to string to each of these Span
, like e.g. lowercase()
, 'uppercase()
, islower()
, ... So a Span
is primarilly a list of ranges on top of a string, which still behaves like a string.
In addition, one can see the above ranges as sets of positins. Then it is quite easy to perform some basic operations on the Span
, for instance the addition of two Span
str(span1 + span2) = 'Simple string for illustration'
is interpreted as the union of their relative sets of positions.
In addition to these logical operations, there are a few utilities, like the possibility to split or slice a Span
into Span
objects, as long as their are all related to the same parent string.
Basic example
Below we give a simple example of usage of the Token
and Tokens
classes.
import re
from tokenspan import Token
string = 'Simple string for demonstration and for illustration.'
initial_token = Token(string)
# char-gram generation
chargrams = initial_token.slice(0,len(initial_token),3)
str(chargrams[2])
# return 'mpl'
# each char-gram conserves a memory of the initial string
chargrams[2].string
# return 'Simple string for demonstration and for illustration.'
cuts = [range(r.start(),r.end()) for r in re.finditer(r'\w+',string)]
tokens = initial_token.split(cuts)
# --> this is a Tokens instance, not a Token one ! (see documentation for explanation)
# tokens conserve the cutted parts, but behaves like a list
interesting_tokens = tokens[1::2]
# so one has to take only odd elements
# n-gram construction
ngram = interesting_tokens.slice(0,len(interesting_tokens),2)
ngram[2]
# return Token('for demonstration', 2 ranges)
str(ngram[2])
# return 'for demonstration'
ngram[2].ranges
# return [range(14, 17), range(18, 31)]
ngram[2].subTokens
# return the Tokens instance composed of token 'for' and token 'demonstration'
# add attributes to a Token
tok0 = interesting_tokens[0]
tok0.setattr('name_of_attribute',{'some_key':'some_value'})
# and take the attribute back
tok0.name_of_attribute
# return {'some_key':'some_value'}
# are the two 'for' Token the same ?
interesting_tokens[2] == interesting_tokens[-2]
# return False, because they are not at the same position
# basic operations among Token
for_for = interesting_tokens[2] + interesting_tokens[-2]
str(for_for)
# return 'for for'
for_for.ranges
# return [range(14, 17), range(36, 39)]
for_for.string
# return 'Simple string for demonstration and for illustration.'
# to check the positions of the two 'for' Token :
# '01234567890...456...01234567890.....678.0123456789012'
# also available :
# tok1 + tok2 : union of the sets of tok1.ranges and tok2.ranges
# tok1 - tok2 : difference of tok1.ranges and tok2.ranges
# tok1 * tok2 : intersection of tok1.ranges and tok2.ranges
# tok1 / tok2 : symmetric difference of tok1.ranges and tok2.ranges
# reconstruction of a Token
simple_demonstration = interesting_tokens[0:5:3].join()
# one could have done interesting_tokens.join(0,5,3) as well
# it contains two non-overlapping sub-parts
str(simple_demonstration)
# return 'Simple demonstration'
# basic string methods from Python are still there
simple_demonstration.lower()
# return 'simple demonstration'
Other examples can be found in the documentation.
Comparison with other Python libraries
A comparison with some other NLP librairies (nltk, gensim, spaCy, gateNLP, ...) can be found in the documentation
Versions
- Versions before 0.4 only present the
Token
andTokens
classes. They have been splitted after in three classes, namedSpan
,Token
andTokens
. Importantly, the methodsToken.append
andToken.remove
no longer exist in the next version. They have been replaced byToken.append_range
,Token.append_ranges
,Token.remove_range
andToken.remove_ranges
. - Version 0.4 add the class
Span
toToken
andTokens
.Span
handles the sub-parts splitting of a given string, whereasToken
andTokens
now consumesSpan
objects and handle the attributes of theToken
. - From version 0.5, one has split the basic tools
Span
,Token
andTokens
from theiamtokenizing
package (see https://pypi.org/project/iamtokenizing/). Only the advanced tokenizer are now present in the packageiamtokenizing
, which depends on the packagetokenspan
. The objectsSpan
,Token
andTokens
can be called as before from the newly deployed packagetokenspan
, available on https://pypi.org/project/tokenspan/.
About us
Package developped for Natural Language Processing at IAM : Unité d'Informatique et d'Archivistique Médicale, Service d'Informatique Médicale, Pôle de Santé Publique, Centre Hospitalo-Universitaire (CHU) de Bordeaux, France.
You are kindly encouraged to contact the authors by issue on the official repository, and to propose ameliorations and/or suggestions to the authors, via issue or merge requests.
Last version : August 5, 2021
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tokenspan-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5475465f053474135ec761213c036b0ff7f5e5ed56392d94e4e310a24134e971 |
|
MD5 | a1e716c5fb21f7fc5d8e5e8d37f2e70e |
|
BLAKE2b-256 | 37a8148aba1f02f3f3b9a297e1764cc694978312b922656696329f281919885c |