Skip to main content

Colibri Core is an NLP tool as well as a C++ and Python library (all included in this package) for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` which allows you to build, view, manipulate and query pattern models.

Project description

Colibri Core

GitHub C++ build GitHub Python build DOI GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed. PyPi

by Maarten van Gompel, proycon@anaproy.nl, Radboud University Nijmegen

Licensed under GPLv3 (See http://www.gnu.org/licenses/gpl-3.0.html)

Colibri Core is software to quickly and efficiently count and extract patterns from large corpus data, to extract various statistics on the extracted patterns, and to compute relations between the extracted patterns. The employed notion of pattern or construction encompasses the following categories:

  • n-gram -- n consecutive words
  • skipgram -- An abstract pattern of predetermined length with one or multiple gaps (of specific size).
  • flexgram -- An abstract pattern with one or more gaps of variable-size.

N-gram extraction may seem fairly trivial at first, with a few lines in your favourite scripting language, you can move a simple sliding window of size n over your corpus and store the results in some kind of hashmap. This trivial approach however makes an unnecessarily high demand on memory resources, this often becomes prohibitive if unleashed on large corpora. Colibri Core tries to minimise these space requirements in several ways:

  • Compressed binary representation -- Each word type is assigned a numeric class, which is encoded in a compact binary format in which highly frequent classes take less space than less frequent classes. Colibri core always uses this representation rather than a full string representation, both on disk and in memory.
  • Informed iterative counting -- Counting is performed more intelligently by iteratively processing the corpus in several passes and quickly discarding patterns that won't reach the desired occurrence threshold.

Skipgram and flexgram extraction are computationally more demanding but have been implemented with similar optimisations. Skipgrams are computed by abstracting over n-grams, and flexgrams in turn are computed either by abstracting over skipgrams, or directly from n-grams on the basis of co-occurrence information (mutual pointwise information).

At the heart of the sofware is the notion of pattern models. The core tool, to be used from the command-line, is colibri-patternmodeller which enables you to build pattern models, generate statistical reports, query for specific patterns and relations, and manipulate models.

A pattern model is simply a collection of extracted patterns (any of the three categories) and their counts from a specific corpus. Pattern models come in two varieties:

  • Unindexed Pattern Model -- The simplest form, which simply stores the patterns and their count.
  • Indexed Pattern Model -- The more informed form, which retains all indices to the original corpus, at the cost of more memory/diskspace.

The Indexed Pattern Model is much more powerful, and allows more statistics and relations to be inferred.

The generation of pattern models is optionally parametrised by a minimum occurrence threshold, a maximum pattern length, and a lower-boundary on the different types that may instantiate a skipgram (i.e. possible fillings of the gaps).

Technical Details

Colibri Core is available as a collection of standalone command-line tools, as a C++ library, and as a Python library.

Please consult the full documentation at https://proycon.github.io/colibri-core

Installation

Python binding

For the Colibri Core Python library, just install using:

pip install colibricore

We strongly recommend you use a Virtual Environment for this. Do note that this is only available for unix-like systems, Windows is not supported.

Installation from source

For the command-line tools, check if your distribution has a package available. There are packages for Alpine Linux (apk add colibri-core) and for macOS with homebrew (brew tap fbkarsdorp/homebrew-lamachine && brew install colibri-core). Note that these do not contain the Python binding!

If no packages are available, you will need to compile from source or use the container build (e.g. Docker) as explained later on.

In order to do so, you need a sane build environment, install the necessary dependencies for your distribution:

For Debian/Ubuntu::

$ sudo apt-get install make gcc g++ pkg-config autoconf-archive libtool autotools-dev libbz2-dev zlib1g-dev libtar-dev python3 python3-dev cython3

For RedHat-based systems (run as root)::

# yum install pkgconfig libtool autoconf automake autoconf-archive make gcc gcc-c++ libtar libtar-devel python3 python3-devel zlib zlib-devel python3-pip bzip2 bzip2-devel cython3

For macOS with homebrew:

$ brew install autoconf automake libtool autoconf-archive python3 pkg-config

Then clone this repository and install as follows:

$ bash bootstrap
$ ./configure
$ make
$ sudo make install

Container usage

The Colibri Core command-line tools are also available as an OCI/Docker container.

A pre-made container image can be obtained from Docker Hub as follows:

docker pull proycon/colibri-core

You can also build a container image yourself as follows, make sure you are in the root of this repository:

docker build -t proycon/colibri-core .

This builds the latest stable release, if you want to use the latest development version from the git repository instead, do:

docker build -t proycon/colibri-core --build-arg VERSION=development .

Run the frog container interactively as follows, it will dump you into a shell where the various command line tools are available:

docker run -t -i proycon/colibri-core

Add the -v /path/to/your/data:/data parameter if you want to mount your data volume into the container at /data.

Demo

Colibri Core Demo

Publication

This software is extensively described in the following peer-reviewed publication:

van Gompel, M and van den Bosch, A (2016)
Efficient n-gram, Skipgram and Flexgram Modelling with Colibri Core.
*Journal of Open Research Software*
4: e30, DOI: http://dx.doi.org/10.5334/jors.105

Access the publication here and please cite it if you make use of Colibri Core in your work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

colibricore-2.5.9.tar.gz (1.0 MB view details)

Uploaded Source

Built Distributions

colibricore-2.5.9-cp311-cp311-musllinux_1_1_x86_64.whl (6.6 MB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

colibricore-2.5.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

colibricore-2.5.9-cp311-cp311-macosx_11_0_arm64.whl (612.9 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

colibricore-2.5.9-cp311-cp311-macosx_10_9_x86_64.whl (747.1 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

colibricore-2.5.9-cp310-cp310-musllinux_1_1_x86_64.whl (6.5 MB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

colibricore-2.5.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

colibricore-2.5.9-cp310-cp310-macosx_11_0_arm64.whl (613.6 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

colibricore-2.5.9-cp310-cp310-macosx_10_9_x86_64.whl (747.9 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

colibricore-2.5.9-cp39-cp39-musllinux_1_1_x86_64.whl (6.6 MB view details)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

colibricore-2.5.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

colibricore-2.5.9-cp39-cp39-macosx_11_0_arm64.whl (624.7 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

colibricore-2.5.9-cp39-cp39-macosx_10_9_x86_64.whl (763.8 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

colibricore-2.5.9-cp38-cp38-musllinux_1_1_x86_64.whl (6.7 MB view details)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

colibricore-2.5.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

colibricore-2.5.9-cp38-cp38-macosx_11_0_arm64.whl (619.1 kB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

colibricore-2.5.9-cp38-cp38-macosx_10_9_x86_64.whl (747.9 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

colibricore-2.5.9-cp37-cp37m-musllinux_1_1_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

colibricore-2.5.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

colibricore-2.5.9-cp37-cp37m-macosx_10_9_x86_64.whl (746.3 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

colibricore-2.5.9-cp36-cp36m-musllinux_1_1_x86_64.whl (6.5 MB view details)

Uploaded CPython 3.6m musllinux: musl 1.1+ x86-64

colibricore-2.5.9-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

colibricore-2.5.9-cp36-cp36m-macosx_10_9_x86_64.whl (761.9 kB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file colibricore-2.5.9.tar.gz.

File metadata

  • Download URL: colibricore-2.5.9.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for colibricore-2.5.9.tar.gz
Algorithm Hash digest
SHA256 f9d1aedcc7056cb696a0254076a74607a7dabd90aa272ce6e212e658fb833fbf
MD5 b3b35a6c47383aef147c3d7a2625662c
BLAKE2b-256 24fad83cb62edb3c8a5048b83ae691ae223fb0f4dfd22f6c6c91adceb28f45b6

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp311-cp311-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 083db9f20717e10e221c9ed6435b15b26da030149876c725e55b20cc056fb6b6
MD5 16e7d49733357ba277687c859096e46b
BLAKE2b-256 d6c5d8cc28df884e6f56b5ec2f88e9b8313025b15ac5170fb5733291f6ef7421

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 195c4d26c8d6985e27cc2874e0b1dcbb063cca28f7e66d633803d74f20e840b2
MD5 a6d00236a0bb869c97422ef4e086c2d0
BLAKE2b-256 c17614b9b9a7f6cfb76de88538206fed6ed3b33bff6b1b73e45da61661b3af9e

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 10d2759b43082d2a65dc478a1eeb89ccd42d58bbe572b74771c60a3fba00a1f8
MD5 3c1796b7285137f0fcc90beaf43c89b5
BLAKE2b-256 24a0a183b502d6be5af989af5a175dccb1f36eb3457a320101df4eef34aaedbd

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 36ade2b123268e007e971b91f30d9fd9068740615026d7b71412f32a7c60115f
MD5 6abef586a2072d2c0ddb8acb691b493c
BLAKE2b-256 90706c1a2794015b72d335b1a17aead4bf18e10a2248df393386712d468cf869

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp310-cp310-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 97f4d04ed4e3c523c479d215f496f5e9872c9f4ffacccc785621435af17e6091
MD5 418f333e870802140230ab32445f4701
BLAKE2b-256 a6c8e0b8e0df723755ece5758a44de32ae568d3e0f2ccd5cc73fa4e6b84a1f56

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 459a7d35ae58b230adf3b527ec354f9a2b47b97178edc8603e3de9b15b47f117
MD5 40052dabdd52bd659e4977fe1661ddba
BLAKE2b-256 8637ee8d1862c9929959be1309a0f7904e7e1f6d27bdb45aeca6e88ec397ff87

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 34581dd102216f2901c4b2e82901be407e5fa0503531ffb987afa276e0f34f0d
MD5 02a973718d244946e2d95102b46e4d6c
BLAKE2b-256 f16d6e67d13fa6be604fb10bde39299b2df510794c291eda6eb88f79fb8327e5

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b423365c8dd53499cc8fed0afd10d9f2f28924276023fe550777f3d5584edfea
MD5 fc258ac3b1b4bdde847c194844edbea3
BLAKE2b-256 38c27f0fed18a5f39cdc5bb7c7ba36eb56760595cd3ac4ed5926075de849b1ec

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp39-cp39-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 87ecca114a160206ee383b906fc48518bddd304f2ccf03cf5add60cda4a57e0f
MD5 ed161250703a1576a68a4cf087edb708
BLAKE2b-256 a8908a0a95d5adaa3a12dd8d1688d4b7ba40b63b1c355b42734b5d5769445aa1

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8c2c9ad5ab1e5b0a0517b4ff84558211410962b1f5e4fe541495c99f21a494dc
MD5 24164b67af4d06ec93f65afba587b0f0
BLAKE2b-256 a2968bda8320d1599b99d557e172ab0aaf7c8507f766000c3b92e16fda8acd5d

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f0bd829172be84186ce409ef919214524ffef1ed8f28808f0e3f790612b7c899
MD5 1fdd47e8e84270e026e793f0b70127d2
BLAKE2b-256 1a99750ef3c6dcee970874f143e41f8244fa27110f3831018323660986606fe5

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 571d0361885f5cdeae622d4e1729fcc3197661db3501000bcb6afa7fb68692f2
MD5 34da4b4f8fd8944f9db7e2289997f4a5
BLAKE2b-256 8291289902bf5577fabd68cdcde8776856d8ba8e462874fb36e511d810c63723

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp38-cp38-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 d1f87742b346ec2b4ad4145d6df01ad3ed0789c43c729231c6a3e395d6527064
MD5 0b7c3ee82d8ba9b0e07c650892e7d630
BLAKE2b-256 b7d143e83f325428369e9d3489e0d1578016a9635b0a659f7c9bd609b5d0d59e

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 88163a0def8cb2238c6e71d7907d245fe95e0da2718ca0131a4b922435435f5d
MD5 d3e47e402e6dfd456f3194e4cf5339b8
BLAKE2b-256 0ac2536297f83ac554ab6b56dfab382acb2eea3324fe302fed9ec91380824d39

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1a39844d603b543d0333146c00d83634f32aecd7226ee5b01f195205a9cede35
MD5 e4b1ae31704944305048708b86e405c6
BLAKE2b-256 89eab0971ce59e9c75516ff4cb9b66d86114a5ac1e6b5e0225a3d24458821b83

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 76dd7a854193973d4d907fbddb452a0cf906f68b6701b1deba7ba088a53f27cc
MD5 80c68ade1c209d5b7b857d3379def9a3
BLAKE2b-256 e066d2da717523f07d3f4c74e90588f0435cbf7d094ee137917be683fe49a04f

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp37-cp37m-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp37-cp37m-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 f05e65242219faf55e152e3e3a44679b1f70bfafe6cf6405ed05ec2913563952
MD5 c178c3a9c12c0b4186980c7008137523
BLAKE2b-256 f5619f473946575ee558e88085b0ecbbbba0f23df709130fc85d3974bd9eb48d

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5d228c019508e9dad093de42a457ac4a0c3c69d204c529edc9130e113f5896c7
MD5 b6d294479bd648ee672d58d1eb2dde8e
BLAKE2b-256 f8d29d69f6b2425e6cdd21c64bfa994bfb796f61ac24bd579bfe73fc5416c0d3

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8399400d90e1ec88e0444e3e74f8c69e1e48b2c7c7f4adad1f696112ce6a09f9
MD5 65db44f770238df06aa6682626049bc4
BLAKE2b-256 36dd228129040971a15bb73a4aab79fe9e08d1fd89738fbcc867e18927b6affb

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp36-cp36m-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp36-cp36m-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 67604e44c51026e8a74af0bb41a0e0865ed8efd5ae55e359f5779bd1a5fa006a
MD5 582f37f1d0ba1ad0f793d935a84e556c
BLAKE2b-256 155a2fea10c2b4643f365f53dc9731676d2aa8ad40be0623c5ad2501800c61c1

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8729ce07a8fb14333aefef5eb76098315f63fb4b26f5c9dc6bab951b7528066b
MD5 a590a91f7fedd2b9e26a8f27a2dd9922
BLAKE2b-256 193b61a78c165f22ec76033a37dc4619fdb2a0d9edda96f1616f19ab2603772c

See more details on using hashes here.

File details

Details for the file colibricore-2.5.9-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for colibricore-2.5.9-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b234d5d73fa23aaec29fe298ab77b0ff97ff63ee54795fbb38615a52af84391b
MD5 097a95a15ddcbdbec2d6e5d1142b914c
BLAKE2b-256 d8d958a0fef1e8ed853febc03ee8ddaaff35d7ed0d1b6a583ca788d0cbd5917e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page