Skip to main content

Trigram statistics for Icelandic

Project description

https://travis-ci.com/vthorsteinsson/Icegrams.svg?branch=master

Overview

Icegrams is a Python 3.x package that encapsulates a large trigram library for Icelandic. (A trigram is a tuple of three consecutive words or tokens that appear in real-world text.)

The almost 34 million trigrams are heavily compressed using radix tries and quasi-succinct indexes employing Elias-Fano encoding. This enables the compressed trigram file to be mapped directly into memory, with no ex ante decompression, for fast queries (typically ~40 microseconds per lookup).

The Icegrams library is implemented in Python and C/C++, glued together via CFFI.

The trigram storage approach is based on a 2017 paper by Pibiri and Venturini, also referring to Ottaviano and Venturini (2014) regarding partitioned Elias-Fano indexes.

You can use Icegrams to obtain probabilities (relative frequencies) of over a million different unigrams (single words or tokens), or of bigrams (pairs of two words or tokens), or of trigrams. You can also ask it to return the N most likely successors to any unigram or bigram.

Icegrams is useful for instance in spelling correction, predictive typing, to help disabled people write text faster, and for various text generation, statistics and modelling tasks.

Icegrams is built on the database of Greynir.is, comprising over 6 million sentences parsed from Icelandic news articles.

Examples

>>> from icegrams import Ngrams
>>> ng = Ngrams()
>>> ng.freq("Ísland")
42019
>>> ng.prob("Ísland")
0.0003979926900206475
>>> ng.logprob("Ísland")
-7.8290769196308005
>>> ng.freq("Katrín", "Jakobsdóttir")
3518
>>> ng.prob("Katrín", "Jakobsdóttir")
0.23298013245033142
>>> ng.prob("Katrín", "Júlíusdóttir")
0.013642384105960274
>>> ng.freq("velta", "fyrirtækisins", "er")
5
>>> ng.prob("velta", "fyrirtækisins", "er")
0.2272727272727272
>>> ng.prob("velta", "fyrirtækisins", "var")
0.04545454545454544
>>> ng.freq("xxx", "yyy", "zzz")
1

Notes

Icegrams is built with a sliding window over the source text. This means that a sentence such as "Maðurinn borðaði ísinn." results in the following trigrams being added to the database:

("", "", "Maðurinn")
("", "Maðurinn", "borðaði")
("Maðurinn", "borðaði", "ísinn")
("borðaði", "ísinn", ".")
("ísinn", ".", "")
(".", "", "")

The same sliding window strategy is applied for bigrams, so the following bigrams would be recorded for the same sentence:

("", "Maðurinn")
("Maðurinn", "borðaði")
("borðaði", "ísinn")
("ísinn", ".")
(".", "")

This means that you can obtain the N unigrams that most often start a sentence by asking for ng.succ(N, "").

And, of course, four unigrams are also added, one for each token in the sentence.

The tokenization of the source text into unigrams is done with the Tokenizer package and uses the rules documented there.

Prerequisites

This package runs on CPython 3.4 or newer, and on PyPy 3.5 or newer. It has been tested on Linux (gcc on x86-64 and ARMhf), MacOS (clang) and Windows (MSVC).

If a binary wheel package isn’t available on PyPi for your system, you may need to have the python3-dev and/or potentially python3.6-dev packages (or their Windows equivalents) installed on your system to set up Icegrams successfully. This is because a source distribution install requires a C++ compiler and linker:

# Debian or Ubuntu:
sudo apt-get install python3-dev
sudo apt-get install python3.6-dev

Installation

To install this package:

$ pip install icegrams

If you want to be able to edit the source, do like so (assuming you have git installed):

$ git clone https://github.com/vthorsteinsson/Icegrams
$ cd Icegrams
$ # [ Activate your virtualenv here if you have one ]
$ python setup.py develop

The package source code is now in ./src/icegrams.

Tests

To run the built-in tests, install pytest, cd to your Icegrams subdirectory (and optionally activate your virtualenv), then run:

$ python -m pytest

Reference

TBD

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icegrams-0.1.0.tar.gz (34.0 kB view hashes)

Uploaded Source

Built Distributions

icegrams-0.1.0-cp37-cp37m-manylinux1_x86_64.whl (60.9 kB view hashes)

Uploaded CPython 3.7m

icegrams-0.1.0-cp36-cp36m-manylinux1_x86_64.whl (60.9 kB view hashes)

Uploaded CPython 3.6m

icegrams-0.1.0-cp35-cp35m-manylinux1_x86_64.whl (60.9 kB view hashes)

Uploaded CPython 3.5m

icegrams-0.1.0-cp34-cp34m-manylinux1_x86_64.whl (60.9 kB view hashes)

Uploaded CPython 3.4m

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page