tokenizer

A tokenizer for Icelandic text

These details have not been verified by PyPI

Project links

Homepage

Project description

Overview

Tokenization is a necessary first step in many natural language processing tasks, such as word counting, parsing, spell checking, corpus generation, and statistical analysis of text.

Tokenizer is a compact pure-Python (2 and 3) module for tokenizing Icelandic text. It converts Python text strings to streams of token objects, where each token object is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences.

The package contains a dictionary of common Icelandic abbreviations, in the file src/tokenizer/Abbrev.conf.

Tokenizer is derived from a corresponding module in the Greynir project (GitHub repository here), by the same author. Note that Tokenizer is licensed under the MIT license while Greynir is licensed under GPLv3.

You might also find the Reynir natural language parser for Icelandic, interesting. The Reynir parser uses Tokenizer on its input.

To install:

$ pip install tokenizer

To use (for Python 3, you can omit the u"" string prefix):

from tokenizer import tokenize, TOK

text = (u"Málinu var vísað til stjórnskipunar- og eftirlitsnefndar "
        u"skv. 3. gr. XVII. kafla laga nr. 10/2007 þann 3. janúar 2010.")

for token in tokenize(text):

        print(u"{0}: '{1}' {2}".format(
                TOK.descr[token.kind],
                token.txt or "-",
                token.val or ""))

Output:

S_BEGIN: '-' (0, None)
WORD: 'Málinu'
WORD: 'var'
WORD: 'vísað'
WORD: 'til'
WORD: 'stjórnskipunar- og eftirlitsnefndar'
WORD: 'skv.' [('samkvæmt', 0, 'fs', 'skst', 'skv.', '-')]
ORDINAL: '3.' 3
WORD: 'gr.' [('grein', 0, 'kvk', 'skst', 'gr.', '-')]
ORDINAL: 'XVII.' 17
WORD: 'kafla'
WORD: 'laga'
WORD: 'nr.' [('númer', 0, 'hk', 'skst', 'nr.', '-')]
NUMBER: '10' (10, None, None)
PUNCTUATION: '/' 4
YEAR: '2007' 2007
WORD: 'þann'
DATE: '3. janúar 2010' (2010, 1, 3)
PUNCTUATION: '.' 3
S_END: '-'

Note the following:

Sentences are delimited by TOK.S_BEGIN and TOK.S_END tokens.

Composite words, such as stjórnskipunar- og eftirlitsnefndar, are coalesced into one token.

Well-known abbreviations are recognized and their full expansion is available in the token.val field.

Ordinal numbers (3., XVII.) are recognized and their value (3, 17) is available in the token.val field.

Dates, years and times are recognized and the respective year, month, day, hour, minute and second values are included as a tuple in token.val.

Numbers, both integer and real, are recognized and their value is available in the token.val field.

The tokenize() function

To tokenize a text string, call tokenizer.tokenize(text). This function returns a Python generator of token objects. Each token object is a simple namedtuple with three fields: (kind, txt, val) (see below).

The tokenizer.tokenize() function is typically called in a for loop:

for token in tokenizer.tokenize(mystring):
        kind, txt, val = token
        if kind == tokenizer.TOK.WORD:
                # Do something with word tokens
                pass
        else:
                # Do something else
                pass

Alternatively, create a token list from the returned generator:

token_list = list(tokenizer.tokenize(mystring))

The token object

Each token is represented by a namedtuple with three fields: (kind, txt, val).

The kind field contains one of the following integer constants, defined within the TOK class:

PUNCTUATION = 1
TIME = 2
DATE = 3
YEAR = 4
NUMBER = 5
WORD = 6
TELNO = 7
PERCENT = 8
URL = 9
ORDINAL = 10
TIMESTAMP = 11
CURRENCY = 12       # Not used
AMOUNT = 13
PERSON = 14         # Not used
EMAIL = 15
ENTITY = 16         # Not used
UNKNOWN = 17

S_BEGIN = 11001     # Sentence begin
S_END = 11002       # Sentence end

To obtain a descriptive text for a token kind, use TOK.descr[token.kind] (see example above).

The txt field contains the original source text for the token.

In the case of abbreviations that end a sentence, the final period ‘.’ is a separate token, and it is consequently omitted from the abbreviation token’s txt field. A sentence ending in o.s.frv. will thus end with two tokens, the next-to-last one being the tuple (TOK.WORD, "o.s.frv", "og svo framvegis") - note the omitted period in the txt field - and the last one being (TOK.PUNCTUATION, ".", 3) (the 3 is explained below).

The val field contains auxiliary information, corresponding to the token kind, as follows:

For TOK.PUNCTUATION, the val field specifies the whitespace normally found around the symbol in question:

TP_LEFT = 1   # Whitespace to the left
TP_CENTER = 2 # Whitespace to the left and right
TP_RIGHT = 3  # Whitespace to the right
TP_NONE = 4   # No whitespace

For TOK.TIME, the val field contains an (hour, minute, second) tuple.
For TOK.DATE, the val field contains a (year, month, day) tuple (all 1-based).
For TOK.YEAR, the val field contains the year as an integer.
For TOK.NUMBER, the val field contains a tuple (number, None, None). (The two empty fields are included for compatibility with Greynir.)
For TOK.WORD, the val field contains the full expansion of an abbreviation, as a list containing a single tuple, or None if the word is not abbreviated.
For TOK.PERCENT, the val field contains a tuple of (percentage, None, None).
For TOK.ORDINAL, the val field contains the ordinal value as an integer.
For TOK.TIMESTAMP, the val field contains a (year, month, day, hour, minute, second) tuple.
For TOK.AMOUNT, the val field contains an (amount, currency, None, None) tuple. The amount is a float, and the currency is an ISO currency code, i.e. “USD” for dollars ($ sign) or “EUR” for euros (€ sign). (The two empty fields are included for compatibility with Greynir.)

The correct_spaces() function

Tokenizer also contains the utility function tokenizer.correct_spaces(text). This function returns a string after splitting it up and re-joining it with correct whitespace around punctuation tokens. Example:

>>> tokenizer.correct_spaces("Frétt \n  dagsins:Jón\t ,Friðgeir og Páll ! 100  /  2  =   50")
'Frétt dagsins: Jón, Friðgeir og Páll! 100/2 = 50'

The Abbrev.conf file

Abbreviations recognized by Tokenizer are defined in the Abbrev.conf file, found in the src/tokenizer/ directory. This is a text file with abbreviations, their definitions and explanatory comments. The file is loaded into memory during the first call to tokenizer.tokenize() within a process.

Development installation

To install Tokenizer in development mode, where you can easily modify the source files (assuming you have git available):

$ git clone https://github.com/vthorsteinsson/Tokenizer
$ cd Tokenizer
$ # [ Activate your virtualenv here, if you have one ]
$ python setup.py develop

To run the built-in tests, install pytest, cd to your Tokenizer subdirectory (and optionally activate your virtualenv), then run:

$ python -m pytest

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

3.6.2

Dec 12, 2025

3.6.0

Dec 11, 2025

3.5.5

Nov 13, 2025

3.5.4

Nov 12, 2025

3.5.3

Sep 26, 2025

3.5.2

Sep 26, 2025

3.5.1

Aug 28, 2025

3.5.0

Aug 27, 2025

3.4.5

Aug 23, 2024

3.4.4

Aug 7, 2024

3.4.3

Aug 11, 2023

3.4.2

Oct 7, 2022

3.4.1

May 3, 2022

3.4.0

Mar 10, 2022

3.3.2

Sep 27, 2021

3.3.1

Sep 8, 2021

3.3.0

Sep 8, 2021

3.2.0

Aug 16, 2021

3.1.2

Jun 2, 2021

3.1.1

May 10, 2021

3.1.0

Apr 14, 2021

3.0.0

Apr 9, 2021

2.5.0

Mar 8, 2021

2.4.0

Oct 8, 2020

2.3.1

Sep 21, 2020

2.3.0

Sep 3, 2020

2.2.0

Aug 20, 2020

2.1.0

Jul 2, 2020

2.0.7

Jun 24, 2020

2.0.6

May 29, 2020

2.0.5

Mar 26, 2020

2.0.4

Feb 17, 2020

2.0.3

Dec 17, 2019

2.0.2

Dec 11, 2019

2.0.1

Dec 9, 2019

2.0.0

Dec 4, 2019

1.4.1

Oct 22, 2019

1.4.0

Jul 16, 2019

1.3.0

May 21, 2019

1.2.3

May 3, 2019

1.2.2

Apr 26, 2019

1.2.1

Feb 18, 2019

1.2.0

Feb 7, 2019

1.1.2

Jan 10, 2019

1.1.1

Jan 4, 2019

1.1.0

Jan 2, 2019

1.0.9

Dec 29, 2018

1.0.8

Oct 30, 2018

1.0.7

Sep 25, 2018

1.0.6

Aug 23, 2018

1.0.5

Jul 23, 2018

1.0.4

Jun 8, 2018

1.0.3

Apr 30, 2018

1.0.2

Apr 25, 2018

This version

1.0.1

Apr 24, 2018

1.0.0

Apr 20, 2018

0.1.14

Apr 11, 2018

0.1.12

Apr 5, 2018

0.1.11

Mar 27, 2018

0.1.10

Mar 13, 2018

0.1.3

Oct 1, 2017

0.1.2

Oct 1, 2017

0.1.1

Oct 1, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizer-1.0.1.tar.gz (40.1 kB view details)

Uploaded Apr 24, 2018 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokenizer-1.0.1-py2.py3-none-any.whl (44.9 kB view details)

Uploaded Apr 24, 2018 Python 2Python 3

File details

Details for the file tokenizer-1.0.1.tar.gz.

File metadata

Download URL: tokenizer-1.0.1.tar.gz
Upload date: Apr 24, 2018
Size: 40.1 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for tokenizer-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`1cc9a9d238ba9c5612a7a769cbb2b8501f83faf3c263adbaedb922a953db7d9c`
MD5	`16c52b54690264bc2d4844c2cdf23ca4`
BLAKE2b-256	`9aec5d409842b01318dbdf6912a4c64962092b7e977bce2cf7ea9d50bcd93c93`

See more details on using hashes here.

File details

Details for the file tokenizer-1.0.1-py2.py3-none-any.whl.

File metadata

Download URL: tokenizer-1.0.1-py2.py3-none-any.whl
Upload date: Apr 24, 2018
Size: 44.9 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for tokenizer-1.0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`fb0a5c329ddb400919e8409bea724357b27b965a8b020e248bff425453d3170f`
MD5	`923b667f809b9e0fec4f3a56a0a2a77d`
BLAKE2b-256	`c4a7ef324aedcaf4b4f7f5fe3244e70e6df569ab12b742b030fbdc6b80d7f108`

See more details on using hashes here.

tokenizer 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

The tokenize() function

The token object

The correct_spaces() function

The Abbrev.conf file

Development installation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes