Skip to main content

Python package for lexicon.

Project description

lexpy logo

Downloads PyPI version Travis Build status Coverage Status Maintainability

Python 2.7 Python 3.5 Python 3.6 Python 3.7 Python 3.8

A lexicon is a data-structure which stores a set of words. The difference between a dictionary and a lexicon is that in a lexicon there are no values associated with the words. A lexicon is similar to a list of words or a set, but the internal representation is different and optimized for faster searches(of words, prefixes and wildcard patterns). Precisely the search time is O(W) where W is the length of the word.

2 important Lexicon data-structures are:

  1. Trie.
  2. Directed Acyclic Word Graph(DAWG).

Both Trie and DAWG are Finite State Automaton(FSA)

Install

pip install lexpy

Interface

Interface Description Trie method DAWG method
Add a single word add('apple') add('apple')
Add multiple words add_all(['advantage', 'courage']) add_all(['advantage', 'courage'])
Check if exists? in operator in operator
Search using wildcard expression search('a?b*') search('a?b*)
Search for prefix matches search_with_prefix('bar') search_with_prefix('bar')
Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance (LD) search_within_distance('apble', dist=1) search_within_distance('apble', dist=1)
Get the number of nodes in the automaton len(trie) len(dawg)

Examples

Although, the examples below are shown only for trie, they can be used for a DAWG in the same way. Both Trie and DAWG support the same set of operations as shown in the above table. However, do read the section on "DAWG".

Ways to build a Trie or a DAWG.

  1. From an input list, set, or tuple of words.
from lexpy.trie import Trie
trie = Trie()
input_words = [
    'ampyx', 'abuzz', 'athie', 'amato', 'aneto', 'aruba', 'arrow', 'agony', 'altai', 'alisa',
    'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin', 'almug', 'artha', 'algin', 'auric',
    'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit', 'tank', 'common', 'lonely', 'likeable'
    'language', 'shock', 'look', 'pet', 'dime', 'small' 'dusty', 'accept', 'nasty', 'thrill',
    'foot', 'steel'
]

trie.add_all(input_words) # You can pass any sequence types of a file like object here

print(trie.get_word_count())
40
  1. Use the build_trie_from_file() method
from lexpy.utils import build_trie_from_file
trie = build_trie_from_file('/path/to/file')
  1. From a file-like object.
from lexpy.trie import Trie

# Either
trie.add_all('/path/to/file.txt')

# Or
with open('/path/to/file.txt', 'r') as infile:
     trie.add_all(infile)

Search

  1. Check if exists using the in operator
print('ampyx' in trie)
True
  1. Prefix search
print(trie.search_with_prefix('ab'))
['abhor', 'abuzz']
  1. Wildcard search using ? and *

? = 0 or 1 occurance of any character

* = 0 or more occurance of any character

print(trie.search('a*o*'))
['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']

print(trie.search('su?t'))
['suit']
  1. Search for similar words using the notion of Levenstien Distance(LD)
print(trie.search_within_distance('arie', dist=2))
['athie', 'arbil', 'auric']

Directed Acyclic Word Graph (DAWG)

DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is always less than or equal to the number of nodes in Trie. They both are Deterministic Finite State Automata. However, DAWG is a minimized version of the Trie DFA. In a Trie, prefix redundancy is removed. In a DAWG, both prefix and suffix redundancies are removed.

In the current implementation of DAWG, the insertion order of the words should be alphabetical.

from lexpy.trie import Trie
from lexpy.dawg import DAWG

trie = Trie()
trie.add_all(['advantageous', 'courageous'])

dawg = DAWG()
dawg.add_all(['advantageous', 'courageous'])

len(trie) # Number of Nodes in Trie
23

dawg.reduce() # Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.

len(dawg) # Number of nodes in DAWG
16

Fun Facts :

  1. The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary. So for all english words, the search time is bounded by O(45).
  2. The longest technical word(not in dictionary) is the name of a protein called as titin. It has 189,819 letters and it is disputed whether it is a word.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

lexpy-0.9.5-py2.py3-none-any.whl (26.7 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page