ngrammatcher

Matches pre-defined ngrams from a given list of words/tokens.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

NGramMatcher is a module that can be used to extract n-grams, tokens, or keywords from a list of tokens.

Installation

$ pip install ngrammatcher

Usage

Overview

from ngrammatcher import NGramMatcher

# init NgramMatcher object
ngm = NGramMatcher()

# add ngrams and their corresponding data
ngm.insert_ngram(['programming','language'], 'programming language')
ngm.insert_ngram(['Python'], 'Python')

# match ngrams
ngm.match_ngrams(['Python', 'is', 'a', 'programming', 'language'])
# ['Python', 'programming language']

Adding n-grams

from ngrammatcher import NGramMatcher
ngm = NGramMatcher()

# You can add n-grams of any size
ngm.insert_ngram(['programming','language'], 'programming language') # 2-gram
ngm.insert_ngram(['Python'], 'Python') # 1-gram
ngm.insert_ngram(['a']*10000, 'a'*10000) # 10_000-gram

# you can map any kind of data to an n-gram
data = {
    'word': 'programming language',
    'wikipedia': 'https://en.wikipedia.org/wiki/Programming_language'
    'desc': 'A programming language is any set of rules that converts...'
}
ngm.insert_ngram(['programming', 'language'], data)

# you can also insert n-grams using dictionary sytax
ngm[['c','plus','plus']] = 'c++'

# or add words
ngm.insert_ngram(list('test'), 'test')

Finding n-grams

from ngrammatcher import NGramMatcher
ngm = NGramMatcher()
ngm.insert_ngram(['programming','language'], 'programming language')
ngm.insert_ngram(['Python'], 'Python')

# here we will use spacy to create tokens
import spacy
nlp = spacy.load('en_core_web_lg')
text = 'Python is a programming language'

tokens = [tok.text for tok in nlp(text)]

# find n-grams
ngm.match_ngrams(tokens)
# ['Python', 'programming language']

Additional Functionality

from ngrammatcher import NGramMatcher
ngm = NGramMatcher()
ngm.insert_ngram(['programming','language'], 'programming language')
ngm.insert_ngram(['Python'], 'Python')

# get all n-grams in the trie
ngm.get_all_ngrams()
# [(['Python'], 'Python'), (['programming', 'language'], 'programming language')]

# you can exclude the data object too
ngm.get_all_ngrams(keys_only=True)
# [['Python'], ['programming', 'language']]

# delete n-grams (returns True if deleted, False otherwise)
ngm.delete_ngram(['Python'])
# True


# Additional Quality-of-Life functionality
len(ngm) # get the number of n-grams in trie

['programming', 'language'] in ngm # check if n-gram is in trie

ngm[['programming', 'language']] = 'PL' # insert an n-gram into the trie

ngm[['programming', 'language']] # get the data for a specific n-gram

del ngm[['programming', 'language']] # delete an ngram using del

Test

$ git clone https://github.com/jwnz/ngrammatcher
$ cd ngrammatcher
$ pip install pytest
$ python setup.py test

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

1.0

Jul 1, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ngrammatcher-1.0.tar.gz (6.7 kB view hashes)

Uploaded Jul 1, 2022 Source

Hashes for ngrammatcher-1.0.tar.gz

Hashes for ngrammatcher-1.0.tar.gz
Algorithm	Hash digest
SHA256	`8a5935839dd2fbbf2da0483b39472445073176c63c26a7155c7b99aa55b42a87`
MD5	`79c5e0d06ad2726236c617ca6bdc0c79`
BLAKE2b-256	`066087a4cc31d524a3deaa7828a30b5fda43db35f3a10d9fd2c6f4a6e7da0878`