Trigram statistics for Icelandic
Project description
Overview
Icegrams is a Python 3.x package that encapsulates a large trigram library for Icelandic. (A trigram is a tuple of three consecutive words or tokens that appear in real-world text.)
The almost 34 million trigrams are heavily compressed using radix tries and quasi-succinct indexes employing Elias-Fano encoding. This enables the compressed trigram file to be mapped directly into memory, with no ex ante decompression, for fast queries (typically ~40 microseconds per lookup).
The Icegrams library is implemented in Python and C/C++, glued together via CFFI.
The trigram storage approach is based on a 2017 paper by Pibiri and Venturini, also referring to Ottaviano and Venturini (2014) regarding partitioned Elias-Fano indexes.
You can use Icegrams to obtain probabilities (relative frequencies) of over a million different unigrams (single words or tokens), or of bigrams (pairs of two words or tokens), or of trigrams. You can also ask it to return the N most likely successors to any unigram or bigram.
Icegrams is useful for instance in spelling correction, predictive typing, to help disabled people write text faster, and for various text generation, statistics and modelling tasks.
Icegrams is built on the database of Greynir.is, comprising over 6 million sentences parsed from Icelandic news articles.
Examples
>>> from icegrams import Ngrams >>> ng = Ngrams() >>> ng.freq("Ísland") 42019 >>> ng.prob("Ísland") 0.0003979926900206475 >>> ng.logprob("Ísland") -7.8290769196308005 >>> ng.freq("Katrín", "Jakobsdóttir") 3518 >>> ng.prob("Katrín", "Jakobsdóttir") 0.23298013245033142 >>> ng.prob("Katrín", "Júlíusdóttir") 0.013642384105960274 >>> ng.freq("velta", "fyrirtækisins", "er") 5 >>> ng.prob("velta", "fyrirtækisins", "er") 0.2272727272727272 >>> ng.prob("velta", "fyrirtækisins", "var") 0.04545454545454544 >>> ng.freq("xxx", "yyy", "zzz") 1
Notes
Icegrams is built with a sliding window over the source text. This means that a sentence such as "Maðurinn borðaði ísinn." results in the following trigrams being added to the database:
("", "", "Maðurinn") ("", "Maðurinn", "borðaði") ("Maðurinn", "borðaði", "ísinn") ("borðaði", "ísinn", ".") ("ísinn", ".", "") (".", "", "")
The same sliding window strategy is applied for bigrams, so the following bigrams would be recorded for the same sentence:
("", "Maðurinn") ("Maðurinn", "borðaði") ("borðaði", "ísinn") ("ísinn", ".") (".", "")
This means that you can obtain the N unigrams that most often start a sentence by asking for ng.succ(N, "").
And, of course, four unigrams are also added, one for each token in the sentence.
The tokenization of the source text into unigrams is done with the Tokenizer package and uses the rules documented there.
Prerequisites
This package runs on CPython 3.4 or newer, and on PyPy 3.5 or newer. It has been tested on Linux (gcc on x86-64 and ARMhf), MacOS (clang) and Windows (MSVC).
If a binary wheel package isn’t available on PyPi for your system, you may need to have the python3-dev and/or potentially python3.6-dev packages (or their Windows equivalents) installed on your system to set up Icegrams successfully. This is because a source distribution install requires a C++ compiler and linker:
# Debian or Ubuntu: sudo apt-get install python3-dev sudo apt-get install python3.6-dev
Installation
To install this package:
$ pip install icegrams
If you want to be able to edit the source, do like so (assuming you have git installed):
$ git clone https://github.com/vthorsteinsson/Icegrams $ cd Icegrams $ # [ Activate your virtualenv here if you have one ] $ python setup.py develop
The package source code is now in ./src/icegrams.
Tests
To run the built-in tests, install pytest, cd to your Icegrams subdirectory (and optionally activate your virtualenv), then run:
$ python -m pytest
Reference
TBD
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for icegrams-0.1.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 271469ac8582072a412d6781a5cdbe10abc5e38b859ee1528e45495a1dd5c78a |
|
MD5 | 005a97782cbcf3d939cf898b5522cf46 |
|
BLAKE2b-256 | f7cf23fcc2b1ee1a2c7252eaf85a142c775ede12a63dab9c8cae46f63a661094 |
Hashes for icegrams-0.1.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 78dccce715c0b7148d155d5f7aa1d1eb8a43f0ab4ce34a3f0e2e659fa1407b14 |
|
MD5 | 7113b374b2b2e838d6382139f74d8c64 |
|
BLAKE2b-256 | 41bd33d719271ea7f9f92f1f3f0404520dc877aaac98135860f91466c6fc5220 |
Hashes for icegrams-0.1.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d86e5702fa81754a8a542bc75f718b400ed060b153bbba772618e5528c3f738 |
|
MD5 | 8955501bba574baa24c7d1b84fef1a94 |
|
BLAKE2b-256 | 3e048b4a970bbaa62300551267236b10db2fdd26067f7a50ed366431e316ff07 |
Hashes for icegrams-0.1.0-cp34-cp34m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e26cefed3d85af69daa34095327b49b8a85c7ca53af5f7226b41ca2f74d12a48 |
|
MD5 | 07178098be170fc1aa4c3d6e8e7e4fbc |
|
BLAKE2b-256 | 6f6a9af73f1ec5d9996c83651710282bbf4ba56d52638c610c007ec516fd1d2f |