Basic tools to tokenize (i.e. to construct atomic-entities/sub-strings of) a string, for Natural Language Processing (NLP). Usefull also for annotation, tree parsing, entity linking, ... (in fact, anything that links a string or its sub-parts to an other object). Key concepts are versatility to other librairies, and freedom to define many concepts on top of a string.
Project description
Tokenization for language processing
This package contains some generic configurable tools allowing to cut a string in sub-parts (cf. Wikipedia), called ExtractionString
. An ExtractionString
is a sub-string from a parent string (say the initial complete text), with associated intervals of non-overlaping characters. The number of associated intervals is arbitrary.
ExtractionString
class allow basic tokenization of text, such as word splitting, n-gram splitting, char-gram splitting of arbitrary size. In addition, it allows to associate several non-overlapping sub-strings into a given ExtractionString
. One can compare two different ExtractionString
objects in terms of their intervals. One can also apply basic mathematical operations and logic to them (+, -, *, /) corresponding to the union, difference, intersection and symmetric difference implemented by Python set ; here the sets are the intervals of position from the parent string. Finally, there are some ordering possibilities among the different ExtractionString
constructed from the same parent string.
Depositories, and online documentation
The different sources of informations for this packages are :
- the official Python Package Installation (PyPI) repository is on https://pypi.org/project/extractionstring
- the official git repository is on https://framagit.org/nlp/extractionstring
- the official documentation is on https://nlp.frama.io/extractionstring/
Philosophy of this library
In extractionstring
, one thinks of a string as a collection of integers: the position of each character in the string. For instance
'Simple string for demonstration and for illustration.' # the parent string
'01234567891123456789212345678931234567894123456789512' # the positions
' string for illustration ' # the ExtractionString es
' 789112 678 412345678951 ' # the associated positions
'Simple ' # the ExtractionString es2
'012345 ' # the associated positions
To define the ExtractionString
'string for illustration'
consists in selecting the positions [7,13, 36,39, 40,52]
from the parent string, and the ExtractionString
'simple'
is defined by the positions [0,6,]
.
In addition, one can see the above ranges as sets of positins. Then it is quite easy to perform some basic operations on the Span
, for instance the addition of two ExtractionString
str(es1 + es2) = 'Simple string for illustration'
is interpreted as the union of their relative sets of positions.
In addition to these logical operations, there are a few utilities, like the possibility to split or slice a ExtractionString
object, as long as their are all related to the same parent string.
Basic example
Below we give a simple example of usage of the ExtractionString
class.
import re
from extractionstring import ExtractionString
string = 'Simple string for demonstration and for illustration.'
initial_span = ExtractionString(string)
# char-gram generation
chargrams = initial_span.slice(0, len(initial_span), 3)
str(chargrams[2])
# return 'mpl'
# each char-gram conserves a memory of the initial string
chargrams[2].string
# return 'Simple string for demonstration and for illustration.'
cuts = []
for r in re.finditer(r'\W+', string):
cuts += [r.start(), r.end()]
spans = initial_span.split(cuts)
# this returns a list of ExtractionString objects
# representing the tokens as if string.split() was applied
# an other possibility to keep only the words is to construct it explicitly
cuts = []
for r in re.finditer(r'\w+', string):
cuts += [r.start(), r.end()]
spans = ExtractionString(string, intervals=cuts).extractions
# extractions attribute contains the list of sub-tokens
# 2-gram construction
ngram = [ExtractionString(string, intervals=cuts[2*i:2*i+4])
for i in range(len(cuts)//2-1)]
ngram[2]
# return ExtractionString('for demonstration', [(14,17),(18,31)])
str(ngram[2])
# return 'for demonstration'
ngram[2].intervals
# return EvenSizedSortedSet[(14,17);(18,31)]
ngram[2].extractions
# return [ExtractionString('for', [(14,17)]), ExtractionString('demonstration', [(18,31)])]
# are the two 'for' Token the same ?
spans[2] == spans[-2]
# return False, because they are not at the same position
# basic operations among Token
for_for = spans[2] + spans[-2]
str(for_for)
# return 'for for'
for_for.intervals
# return EvenSizedSortedSet[(14,17);(36,39)]
for_for.string
# return 'Simple string for demonstration and for illustration.'
# to check the positions of the two 'for' ExtractionString :
# '01234567890...456...01234567890.....67890............'
# also available :
# span1 + span2 : union of the sets of span1.intervals and span2.intervals
# span1 - span2 : difference of span1.intervals and span2.intervals
# span1 * span2 : intersection of span1.intervals and span2.intervals
# span1 / span2 : symmetric difference of span1.intervals and span2.intervals
Other examples can be found in the documentation.
Comparison with other Python libraries
A comparison with some other NLP librairies (nltk, gensim, spaCy, gateNLP, ...) can be found in the documentation
Installation
Simply run
pip install extractionstring
should install the library from Python Package Index (PIP). The official repository is on https://framagit.org/nlp/extractionstring. To install the package from the repository, run the following command lines
git clone https://framagit.org/nlp/extractionstring.git
cd extractionstring/
pip install .
Once installed, one can run some tests using
cd tests/
python3 -m unittest -v
(verbosity -v
is an option).
Versions
See CHANGES file in this folder.
About us
Package developped for Natural Language Processing at IAM : Unité d'Informatique et d'Archivistique Médicale, Service d'Informatique Médicale, Pôle de Santé Publique, Centre Hospitalo-Universitaire (CHU) de Bordeaux, France.
You are kindly encouraged to contact the authors by issue on the official repository, and to propose ameliorations and/or suggestions to the authors, via issue or merge requests.
Last version : Jan 3, 2023
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for extractionstring-0.8.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 482b6f5a9bb0f4f2f5b2dcbef9c0b4a4f2c3e39262239469bf809d6b0dfe5e97 |
|
MD5 | 4a2f322a9f3bdfd79071bf077fd0da13 |
|
BLAKE2b-256 | 0bc4e1f6c36e66f6f0b682d89aa03835160695e16c38283f1a7423224d42faa7 |