Skip to main content

Matches pre-defined ngrams from a given list of words/tokens.

Project description

Version Test coverage license

NGramMatcher is a module that can be used to extract n-grams, tokens, or keywords from a list of tokens.

Installation

$ pip install ngrammatcher

Usage

Overview

from ngrammatcher import NGramMatcher

# init NgramMatcher object
ngm = NGramMatcher()

# add ngrams and their corresponding data
ngm.insert_ngram(['programming','language'], 'programming language')
ngm.insert_ngram(['Python'], 'Python')

# match ngrams
ngm.match_ngrams(['Python', 'is', 'a', 'programming', 'language'])
# ['Python', 'programming language']

Adding n-grams

from ngrammatcher import NGramMatcher
ngm = NGramMatcher()

# You can add n-grams of any size
ngm.insert_ngram(['programming','language'], 'programming language') # 2-gram
ngm.insert_ngram(['Python'], 'Python') # 1-gram
ngm.insert_ngram(['a']*10000, 'a'*10000) # 10_000-gram

# you can map any kind of data to an n-gram
data = {
    'word': 'programming language',
    'wikipedia': 'https://en.wikipedia.org/wiki/Programming_language'
    'desc': 'A programming language is any set of rules that converts...'
}
ngm.insert_ngram(['programming', 'language'], data)

# you can also insert n-grams using dictionary sytax
ngm[['c','plus','plus']] = 'c++'

# or add words
ngm.insert_ngram(list('test'), 'test')

Finding n-grams

from ngrammatcher import NGramMatcher
ngm = NGramMatcher()
ngm.insert_ngram(['programming','language'], 'programming language')
ngm.insert_ngram(['Python'], 'Python')

# here we will use spacy to create tokens
import spacy
nlp = spacy.load('en_core_web_lg')
text = 'Python is a programming language'

tokens = [tok.text for tok in nlp(text)]

# find n-grams
ngm.match_ngrams(tokens)
# ['Python', 'programming language']

Additional Functionality

from ngrammatcher import NGramMatcher
ngm = NGramMatcher()
ngm.insert_ngram(['programming','language'], 'programming language')
ngm.insert_ngram(['Python'], 'Python')

# get all n-grams in the trie
ngm.get_all_ngrams()
# [(['Python'], 'Python'), (['programming', 'language'], 'programming language')]

# you can exclude the data object too
ngm.get_all_ngrams(keys_only=True)
# [['Python'], ['programming', 'language']]

# delete n-grams (returns True if deleted, False otherwise)
ngm.delete_ngram(['Python'])
# True


# Additional Quality-of-Life functionality
len(ngm) # get the number of n-grams in trie

['programming', 'language'] in ngm # check if n-gram is in trie

ngm[['programming', 'language']] = 'PL' # insert an n-gram into the trie

ngm[['programming', 'language']] # get the data for a specific n-gram

del ngm[['programming', 'language']] # delete an ngram using del

Test

$ git clone https://github.com/jwnz/ngrammatcher
$ cd ngrammatcher
$ pip install pytest
$ python setup.py test

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ngrammatcher-1.0.tar.gz (6.7 kB view details)

Uploaded Source

File details

Details for the file ngrammatcher-1.0.tar.gz.

File metadata

  • Download URL: ngrammatcher-1.0.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.8

File hashes

Hashes for ngrammatcher-1.0.tar.gz
Algorithm Hash digest
SHA256 8a5935839dd2fbbf2da0483b39472445073176c63c26a7155c7b99aa55b42a87
MD5 79c5e0d06ad2726236c617ca6bdc0c79
BLAKE2b-256 066087a4cc31d524a3deaa7828a30b5fda43db35f3a10d9fd2c6f4a6e7da0878

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page