tokenizer

A tokenizer for Icelandic text

These details have not been verified by PyPI

Project links

Homepage

Project description

https://travis-ci.com/mideind/Tokenizer.svg?branch=master

Overview

Tokenization is a necessary first step in many natural language processing tasks, such as word counting, parsing, spell checking, corpus generation, and statistical analysis of text.

Tokenizer is a compact pure-Python (2 and 3) module for tokenizing Icelandic text. It converts Python text strings to streams of token objects, where each token object is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences.

The package contains a dictionary of common Icelandic abbreviations, in the file src/tokenizer/Abbrev.conf.

Tokenizer is an independent spinoff from the Greynir project (GitHub repository here), by the same authors. Note that Tokenizer is licensed under the MIT license while Greynir is licensed under GPLv3.

You might also find the Reynir natural language parser for Icelandic interesting. The Reynir parser uses Tokenizer on its input.

To install:

$ pip install tokenizer

To use (for Python 3, you can omit the u"" string prefix):

from tokenizer import tokenize, TOK

text = (u"Málinu var vísað til stjórnskipunar- og eftirlitsnefndar "
    u"skv. 3. gr. XVII. kafla laga nr. 10/2007 þann 3. janúar 2010.")

for token in tokenize(text):

    print(u"{0}: '{1}' {2}".format(
        TOK.descr[token.kind],
        token.txt or "-",
        token.val or ""))

Output:

BEGIN SENT: '-' (0, None)
WORD: 'Málinu'
WORD: 'var'
WORD: 'vísað'
WORD: 'til'
WORD: 'stjórnskipunar- og eftirlitsnefndar'
WORD: 'skv.' [('samkvæmt', 0, 'fs', 'skst', 'skv.', '-')]
ORDINAL: '3.' 3
WORD: 'gr.' [('grein', 0, 'kvk', 'skst', 'gr.', '-')]
ORDINAL: 'XVII.' 17
WORD: 'kafla'
WORD: 'laga'
WORD: 'nr.' [('númer', 0, 'hk', 'skst', 'nr.', '-')]
NUMBER: '10' (10, None, None)
PUNCTUATION: '/' 4
YEAR: '2007' 2007
WORD: 'þann'
DATEABS: '3. janúar 2010' (2010, 1, 3)
PUNCTUATION: '.' 3
END SENT: '-'

Note the following:

Sentences are delimited by TOK.S_BEGIN and TOK.S_END tokens.

Composite words, such as stjórnskipunar- og eftirlitsnefndar, are coalesced into one token.

Well-known abbreviations are recognized and their full expansion is available in the token.val field.

Ordinal numbers (3., XVII.) are recognized and their value (3, 17) is available in the token.val field.

Dates, years and times, both absolute and relative, are recognized and the respective year, month, day, hour, minute and second values are included as a tuple in token.val.

Numbers, both integer and real, are recognized and their value is available in the token.val field.

Further details of how Tokenizer processes text can be inferred from the test module in the project’s GitHub repository.

The tokenize() function

To tokenize a text string, call tokenizer.tokenize(text, **options). This function returns a Python generator of token objects. Each token object is a simple namedtuple with three fields: (kind, txt, val) (see below).

The tokenizer.tokenize() function is typically called in a for loop:

for token in tokenizer.tokenize(mystring):
    kind, txt, val = token
    if kind == tokenizer.TOK.WORD:
        # Do something with word tokens
        pass
    else:
        # Do something else
        pass

Alternatively, create a token list from the returned generator:

token_list = list(tokenizer.tokenize(mystring))

In Python 2.7, you can pass either unicode strings or str byte strings to tokenizer.tokenize(). In the latter case, the byte string is assumed to be encoded in UTF-8.

Tokenization options

You can optionally pass one or more of the following options to the tokenizer.tokenize() function:

tokenizer.tokenize(text, convert_numbers=True)

This option causes the tokenizer to convert numbers and amounts with English-style decimal points (.) and thousands separators (,) to Icelandic format, where the decimal separator is a comma (,) and the thousands separator is a period (.). $1,234.56 is thus converted to a token whose text is $1.234,56.

The default value for the convert_numbers option is False.

Note that in versions of Tokenizer prior to 1.4, convert_numbers was True.
tokenizer.tokenize(text, convert_telnos=True)

This option causes the tokenizer to convert telephone numbers to a standard format with the template 888-9999, i.e. three digits followed by a hyphen and then four digits. 1234567 and 123 4567 are thus converted to a token whose text is 123-4567.

The default value for the convert_telnos option is False.

Note that in versions of Tokenizer prior to 1.4, convert_telnos was True.
tokenizer.tokenize(text, handle_kludgy_ordinals=[value])

This options controls the way Tokenizer handles ‘kludgy’ ordinals, such as 1sti, 4ðu, or 2ja. By default, such ordinals are returned unmodified (‘passed through’) as word tokens (TOK.WORD). However, this can be modified as follows:
- tokenizer.KLUDGY_ORDINALS_MODIFY: Kludgy ordinals are corrected to become ‘proper’ word tokens, i.e. 1sti becomes fyrsti and 2ja becomes tveggja.
- tokenizer.KLUDGY_ORDINALS_TRANSLATE: Kludgy ordinals that represent proper ordinal numbers are translated to ordinal tokens (TOK.ORDINAL), with their original text and their ordinal value. 1sti thus becomes a TOK.ORDINAL token with a value of 1, and 3ja becomes a TOK.ORDINAL with a value of 3.
- tokenizer.KLUDGY_ORDINALS_PASS_THROUGH is the default value of the option. It causes kludgy ordinals to be returned unmodified as word tokens.
Note that versions of Tokenizer prior to 1.4 behaved as if handle_kludgy_ordinals were set to tokenizer.KLUDGY_ORDINALS_TRANSLATE.

The token object

Each token is represented by a namedtuple with three fields: (kind, txt, val).

The kind field

The kind field contains one of the following integer constants, defined within the TOK class:

Constant	Value	Explanation	Examples
PUNCTUATION	1	Punctuation	.
TIME	2	Time (h, m, s)	11:35:40
DATE *	3	Date (y, m, d)	[Unused, see DATEABS and DATEREL]
YEAR	4	Year	árið 874 e.Kr. 1965 44 f.Kr.
NUMBER	5	Number	100 1.965 1.965,34 1,965.34 2⅞
WORD	6	Word	kattaeftirlit hunda- og kattaeftirlit
TELNO	7	Telephone number	5254764 699-4244 410 4000
PERCENT	8	Percentage	78%
URL	9	URL	https://greynir.is http://tiny.cc/28695y
ORDINAL	10	Ordinal number	30. XVIII.
TIMESTAMP *	11	Timestamp	[Unused, see TIMESTAMPABS and TIMESTAMPREL]
CURRENCY *	12	Currency name	[Unused]
AMOUNT	13	Amount	€2.345,67 750 þús.kr. 2,7 mrð. USD kr. 9.900 EUR 200
PERSON *	14	Person name	[Unused]
EMAIL	15	E-mail	fake@news.is
ENTITY *	16	Named entity	[Unused]
UNKNOWN	17	Unknown token
DATEABS	18	Absolute date	30. desember 1965 30/12/1965 1965-12-30
DATEREL	19	Relative date	15. mars
TIMESTAMPABS	20	Absolute timestamp	30. desember 1965 11:34 1965-12-30 kl. 13:00
TIMESTAMPREL	21	Relative timestamp	desember kl. 13:00
MEASUREMENT	22	Value with a measurement unit	690 MW 1.010 hPa 220 m² 80° C
NUMWLETTER	23	Number followed by a single letter	14a 7B
DOMAIN	24	Domain name	greynir.is Reddit.com www.wikipedia.org
HASHTAG	25	Hashtag	#MeToo #12stig
S_BEGIN	11001	Start of sentence
S_END	11002	End of sentence

(*) The token types marked with an asterisk are reserved for the Reynir package and not currently returned by the tokenizer.

To obtain a descriptive text for a token kind, use TOK.descr[token.kind] (see example above).

The txt field

The txt field contains the original source text for the token. However, in a few cases, the tokenizer auto-corrects the original source text:

If the appropriate options are specified (see above), it converts kludgy ordinals (3ja) to proper ones (þriðja), English-style thousand and decimal separators to Icelandic ones (10,345.67 becomes 10.345,67), and telephone numbers to a canonical format (123-4567).
It converts single and double quotes to the correct Icelandic ones (i.e. „these“ or ‚these‘).
Tokenizer automatically merges Unicode COMBINING ACUTE ACCENT (code point 769) and COMBINING DIAERESIS (code point 776) with vowels to form single code points for the Icelandic letters á, é, í, ó, ú, ý and ö, in both lower and upper case.

In the case of abbreviations that end a sentence, the final period ‘.’ is a separate token, and it is consequently omitted from the abbreviation token’s txt field. A sentence ending in o.s.frv. will thus end with two tokens, the next-to-last one being the tuple (TOK.WORD, "o.s.frv", "og svo framvegis") - note the omitted period in the txt field - and the last one being (TOK.PUNCTUATION, ".", 3) (the 3 is explained below).

The val field

The val field contains auxiliary information, corresponding to the token kind, as follows:

For TOK.PUNCTUATION, the val field specifies the whitespace normally found around the symbol in question:

TP_LEFT = 1   # Whitespace to the left
TP_CENTER = 2 # Whitespace to the left and right
TP_RIGHT = 3  # Whitespace to the right
TP_NONE = 4   # No whitespace

For TOK.TIME, the val field contains an (hour, minute, second) tuple.
For TOK.DATEABS, the val field contains a (year, month, day) tuple (all 1-based).
For TOK.DATEREL, the val field contains a (year, month, day) tuple (all 1-based), except that a least one of the tuple fields is missing and set to 0. Example: þriðja júní becomes TOK.DATEREL with the fields (0, 6, 3) as the year is missing.
For TOK.YEAR, the val field contains the year as an integer. A negative number indicates that the year is BCE (fyrir Krist), specified with the suffix f.Kr. (e.g. árið 33 f.Kr.).
For TOK.NUMBER, the val field contains a tuple (number, None, None). (The two empty fields are included for compatibility with Greynir.)
For TOK.WORD, the val field contains the full expansion of an abbreviation, as a list containing a single tuple, or None if the word is not abbreviated.
For TOK.PERCENT, the val field contains a tuple of (percentage, None, None).
For TOK.ORDINAL, the val field contains the ordinal value as an integer. The original ordinal may be a decimal number or a Roman numeral.
For TOK.TIMESTAMP, the val field contains a (year, month, day, hour, minute, second) tuple.
For TOK.AMOUNT, the val field contains an (amount, currency, None, None) tuple. The amount is a float, and the currency is an ISO currency code, e.g. USD for dollars ($ sign), EUR for euros (€ sign) or ISK for Icelandic króna (kr. abbreviation). (The two empty fields are included for compatibility with Greynir.)
For TOK.MEASUREMENT, the val field contains a (unit, value) tuple, where unit is a base SI unit (such as g, m, m², s, W, Hz, K for temperature in Kelvin).

The correct_spaces() function

Tokenizer also contains the utility function tokenizer.correct_spaces(text). This function returns a string after splitting it up and re-joining it with correct whitespace around punctuation tokens. Example:

>>> tokenizer.correct_spaces("Frétt \n  dagsins:Jón\t ,Friðgeir og Páll ! 100  /  2  =   50")
'Frétt dagsins: Jón, Friðgeir og Páll! 100/2 = 50'

The Abbrev.conf file

Abbreviations recognized by Tokenizer are defined in the Abbrev.conf file, found in the src/tokenizer/ directory. This is a text file with abbreviations, their definitions and explanatory comments. The file is loaded into memory during the first call to tokenizer.tokenize() within a process.

Development installation

To install Tokenizer in development mode, where you can easily modify the source files (assuming you have git available):

$ git clone https://github.com/mideind/Tokenizer
$ cd Tokenizer
$ # [ Activate your virtualenv here, if you have one ]
$ python setup.py develop

To run the built-in tests, install pytest, cd to your Tokenizer subdirectory (and optionally activate your virtualenv), then run:

$ python -m pytest

Changelog

Version 1.4.0: Added the **options parameter to the tokenize() function, giving control over the handling of numbers, telephone numbers, and ‘kludgy’ ordinals
Version 1.3.0: Added TOK.DOMAIN and TOK.HASHTAG token types; improved handling of capitalized month name Ágúst, which is now recognized when following an ordinal number; improved recognition of telephone numbers; added abbreviations
Version 1.2.3: Added abbreviations; updated GitHub URLs
Version 1.2.2: Added support for composites with more than two parts, i.e. „dómsmála-, ferðamála-, iðnaðar- og nýsköpunarráðherra“; added support for ± sign; added several abbreviations
Version 1.2.1: Fixed bug where the name Ágúst was recognized as a month name; Unicode nonbreaking and invisible space characters are now removed before tokenization
Version 1.2.0: Added support for Unicode fraction characters; enhanced handing of degrees (°, °C, °F); fixed bug in cubic meter measurement unit; more abbreviations
Version 1.1.2: Fixed bug in liter (l and ltr) measurement units
Version 1.1.1: Added mark_paragraphs() function
Version 1.1.0: All abbreviations in Abbrev.conf are now returned with their meaning in a tuple in token.val; handling of ‘mbl.is’ fixed
Version 1.0.9: Added abbreviation ‘MAST’; harmonized copyright headers
Version 1.0.8: Bug fixes in DATEREL, MEASUREMENT and NUMWLETTER token handling; added ‘kWst’ and ‘MWst’ measurement units; blackened
Version 1.0.7: Added TOK.NUMWLETTER token type
Version 1.0.6: Automatic merging of Unicode COMBINING ACUTE ACCENT and COMBINING DIAERESIS code points with vowels
Version 1.0.5: Date/time and amount tokens coalesced to a further extent
Version 1.0.4: Added TOK.DATEABS, TOK.TIMESTAMPABS, TOK.MEASUREMENT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

3.4.5

Aug 23, 2024

3.4.4

Aug 7, 2024

3.4.3

Aug 11, 2023

3.4.2

Oct 7, 2022

3.4.1

May 3, 2022

3.4.0

Mar 10, 2022

3.3.2

Sep 27, 2021

3.3.1

Sep 8, 2021

3.3.0

Sep 8, 2021

3.2.0

Aug 16, 2021

3.1.2

Jun 2, 2021

3.1.1

May 10, 2021

3.1.0

Apr 14, 2021

3.0.0

Apr 9, 2021

2.5.0

Mar 8, 2021

2.4.0

Oct 8, 2020

2.3.1

Sep 21, 2020

2.3.0

Sep 3, 2020

2.2.0

Aug 20, 2020

2.1.0

Jul 2, 2020

2.0.7

Jun 24, 2020

2.0.6

May 29, 2020

2.0.5

Mar 26, 2020

2.0.4

Feb 17, 2020

2.0.3

Dec 17, 2019

2.0.2

Dec 11, 2019

2.0.1

Dec 9, 2019

2.0.0

Dec 4, 2019

This version

1.4.1

Oct 22, 2019

1.4.0

Jul 16, 2019

1.3.0

May 21, 2019

1.2.3

May 3, 2019

1.2.2

Apr 26, 2019

1.2.1

Feb 18, 2019

1.2.0

Feb 7, 2019

1.1.2

Jan 10, 2019

1.1.1

Jan 4, 2019

1.1.0

Jan 2, 2019

1.0.9

Dec 29, 2018

1.0.8

Oct 30, 2018

1.0.7

Sep 25, 2018

1.0.6

Aug 23, 2018

1.0.5

Jul 23, 2018

1.0.4

Jun 8, 2018

1.0.3

Apr 30, 2018

1.0.2

Apr 25, 2018

1.0.1

Apr 24, 2018

1.0.0

Apr 20, 2018

0.1.14

Apr 11, 2018

0.1.12

Apr 5, 2018

0.1.11

Mar 27, 2018

0.1.10

Mar 13, 2018

0.1.3

Oct 1, 2017

0.1.2

Oct 1, 2017

0.1.1

Oct 1, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizer-1.4.1.tar.gz (68.9 kB view details)

Uploaded Oct 22, 2019 Source

Built Distribution

tokenizer-1.4.1-py2.py3-none-any.whl (80.0 kB view details)

Uploaded Oct 22, 2019 Python 2 Python 3

File details

Details for the file tokenizer-1.4.1.tar.gz.

File metadata

Download URL: tokenizer-1.4.1.tar.gz
Upload date: Oct 22, 2019
Size: 68.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 PyPy/7.2.0

File hashes

Hashes for tokenizer-1.4.1.tar.gz
Algorithm	Hash digest
SHA256	`75e9ebc0f6dadad3cb82ba6076a418fcea0b8370d879b5f6b9e3317c8484ebe8`
MD5	`49a70198793bc9c1d75bf58c6f2e7d0b`
BLAKE2b-256	`4a3fac6ae2595e2ddba612d5815bc8c86527cd8d865719f7b9ff3a2db8e43f7c`

See more details on using hashes here.

File details

Details for the file tokenizer-1.4.1-py2.py3-none-any.whl.

File metadata

Download URL: tokenizer-1.4.1-py2.py3-none-any.whl
Upload date: Oct 22, 2019
Size: 80.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 PyPy/7.2.0

File hashes

Hashes for tokenizer-1.4.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ec6325327d45a82841356e343b434c89df96484abe9c6ba2109f85ec65cba7f`
MD5	`0571cb711d00f3c575f4c9b3923bc6c3`
BLAKE2b-256	`1aef2e9a3f764907a46e965d0c311eab8312290dfe723ae10ed2d776583174e0`