A lightweight Python package to work with ngrams and skipgrams
Project description
nskipgrams is a lightweight Python package to work with ngrams and skipgrams. Fields of study using ngrams and skipgrams from sequential data, especially computational linguistics and natural language processing, will find this package helpful.
Highlights:
Simple: Store, access, and count ngrams and skipgrams – that’s it!
Memory-efficient: Tries are used for internal storage.
Hassle-free: No dependencies. Written in pure Python. Today is a great day.
Download and Install
To download and install the most recent version:
$ pip install --upgrade nskipgrams
Usage
The following are defined:
- Ngrams
The class Ngrams handles a collection of ngrams.
The function ngrams_from_seq yields ngrams for a given sequence.
- Skipgrams
The class Skipgrams handles a collection of skipgrams.
The function skipgrams_from_seq yields skipgrams for a given sequence.
Getting Ngrams from a Sequence
If you simply need ngrams from a sequence, ngrams_from_seq is what you’re looking for:
>>> from nskipgrams import ngrams_from_seq
>>> for ngram in ngrams_from_seq("abcdef", n=2):
... print(ngram)
('a', 'b')
('b', 'c')
('c', 'd')
('d', 'e')
('e', 'f')
Initializing an Ngram Collection
>>> from nskipgrams import Ngrams
>>> char_ngrams = Ngrams(n=3) # handles unigrams, bigrams, and trigrams
Adding Ngrams
>>> char_ngrams.add_from_seq("my cats")
>>> char_ngrams.add_from_seq("your cat", count=2)
Here, a sequence is anything that can be iterated over, and the corresponding ngrams are extracted from the individual elements off of the sequence. For example, if the sequence is a text string like "my cats" above, then the ngrams are character-based (hence the chosen variable name char_ngrams).
To add a single ngram:
>>> char_ngrams.add(("y", "o", "u"))
As a best practice, it is recommended that an ngram be represented as a tuple regardless of what the individual elements are, e.g., ("y", "o", "u") for character-based ngrams. As output examples show below, the tuple data type is also what this package uses to represent ngrams.
Accessing Ngrams
>>> for ngram, count in char_ngrams.ngrams_with_counts(n=1): # unigrams
... print(ngram, count)
...
('m',), 1
('y',), 3
(' ',), 3
('c',), 3
('a',), 3
('t',), 3
('s',), 1
('o',), 2
('u',), 2
('r',), 2
>>>
>>> for ngram, count in char_ngrams.ngrams_with_counts(n=2): # bigrams
... print(ngram, count)
...
('m', 'y'), 1
('y', ' '), 1
('y', 'o'), 2
(' ', 'c'), 3
('c', 'a'), 3
('a', 't'), 3
('t', 's'), 1
('o', 'u'), 2
('u', 'r'), 2
('r', ' '), 2
>>>
>>> for ngram, count in char_ngrams.ngrams_with_counts(n=3): # trigrams
... print(ngram, count)
...
('m', 'y', ' '), 1
('y', ' ', 'c'), 1
('y', 'o', 'u'), 3
(' ', 'c', 'a'), 3
('c', 'a', 't'), 3
('a', 't', 's'), 1
('o', 'u', 'r'), 2
('u', 'r', ' '), 2
('r', ' ', 'c'), 2
Accessing Ngrams with a Specific Prefix
>>> for ngram, count in char_ngrams.ngrams_with_counts(n=3, prefix=("y",)):
... print(ngram, count)
...
('y', ' ', 'c'), 1
('y', 'o', 'u'), 3
Accessing the Count of a Specific Ngram
>>> char_ngrams.count(("c", "a", "t"))
3
Checking Membership
To check if an ngram has an exact match in the collection so far:
>>> ("c", "a", "t") in char_ngrams
True
Combining Collections of Ngrams
To combine collections of ngrams (e.g., when you process data sources in parallel and have multiple Ngrams objects):
>>> char_ngrams1 = Ngrams(n=2)
>>> char_ngrams1.add_from_seq("my cat")
>>> set(char_ngrams1.ngrams_with_counts(n=2))
{((' ', 'c'), 1),
(('a', 't'), 1),
(('c', 'a'), 1),
(('m', 'y'), 1),
(('y', ' '), 1)}
>>>
>>> char_ngrams2 = Ngrams(n=2)
>>> char_ngrams2.add_from_seq("your cats")
>>> set(char_ngrams2.ngrams_with_counts(n=2))
{((' ', 'c'), 1),
(('a', 't'), 1),
(('c', 'a'), 1),
(('o', 'u'), 1),
(('r', ' '), 1),
(('t', 's'), 1),
(('u', 'r'), 1),
(('y', 'o'), 1)}
>>>
>>> char_ngrams3 = Ngrams(n=2)
>>> char_ngrams3.add_from_seq("her cats")
>>> set(char_ngrams3.ngrams_with_counts(n=2))
{((' ', 'c'), 1),
(('a', 't'), 1),
(('c', 'a'), 1),
(('e', 'r'), 1),
(('h', 'e'), 1),
(('r', ' '), 1),
(('t', 's'), 1)}
>>>
>>> char_ngrams1.combine(char_ngrams2, char_ngrams3) # `combine` takes as many Ngrams objects as desired
>>> set(char_ngrams1.ngrams_with_counts(n=2))
{((' ', 'c'), 3),
(('a', 't'), 3),
(('c', 'a'), 3),
(('e', 'r'), 1),
(('h', 'e'), 1),
(('m', 'y'), 1),
(('o', 'u'), 1),
(('r', ' '), 2),
(('t', 's'), 2),
(('u', 'r'), 1),
(('y', ' '), 1),
(('y', 'o'), 1)}
If you don’t want to mutate any of the Ngrams instances (the combine method works in-place and mutates these_ngrams when these_ngrams.combine is called), then you can create an empty ngram collection and combine into it all of your ngrams:
>>> collections = [char_ngrams1, char_ngrams2, char_ngrams3]
>>> all_ngrams = Ngrams(n=2) # A new, empty collection of ngrams
>>> all_ngrams.combine(*collections)
Any “Sequences” and their Corresponding “Ngrams” Work
While the examples above use text strings as sequences and character-based ngrams, another common usage in computational linguistics and NLP is to have segmented phrases/sentences as sequences and word-based ngrams:
>>> from nskipgrams import Ngrams
>>> word_ngrams = Ngrams(n=2)
>>> word_ngrams.add_from_seq(("in", "the", "beginning"))
>>> word_ngrams.add_from_seq(("in", "the", "end"))
>>> for ngram, count in word_ngrams.ngrams_with_counts(n=2):
... print(ngram, count)
...
('in', 'the'), 2
('the', 'beginning'), 1
('the', 'end'), 1
Skipgrams
Ngrams are a special case of skipgrams, with skip = 0. The class Skipgrams works the same as Ngrams, with the following differences:
Skipgrams has the method skipgrams_with_counts rather than ngrams_with_counts. skipgrams_with_counts also has the keyword argument skip (in addition to n and prefix).
For Skipgrams, the methods add and count, as well as collection instantiation (i.e., __init__), also have a meaningful skip keyword argument.
The function skipgrams_from_seq works the same as ngrams_from_seq, but has the skip keyword argument (in addition to seq and n).
Citation
Lee, Jackson L. 2023. nskipgrams: A lightweight Python package to work with ngrams and skipgrams. https://doi.org/10.5281/zenodo.4002095
@software{leengrams,
author = {Jackson L. Lee},
title = {nskipgrams: A lightweight Python package to work with ngrams and skipgrams},
year = 2021,
doi = {10.5281/zenodo.4002095},
url = {https://doi.org/10.5281/zenodo.4002095}
}
License
MIT License. Please see LICENSE.txt in the GitHub source code for details.
Changelog
Please see CHANGELOG.md in the GitHub source code.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nskipgrams-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf9b50453a623254fff1e4d953a06969e23207b74d433458dd99ad4e1bd82b68 |
|
MD5 | cb22c17ee19f5745e89cb35c208bc791 |
|
BLAKE2b-256 | f87e637bf86ee93dc81c6760b1765ab28f644bcbca81d0750e42b0033cec4ac5 |