Skip to main content

Python package for lexicon

Project description

Lexpy

lexpy Downloads PyPI version

Python 3.10 Python 3.11 Python 3.12 Python 3.13 Python 3.14

PyPy3.10 PyPy3.11

  • A lexicon is a data-structure which stores a set of words. The difference between a dictionary and a lexicon is that in a lexicon there are no values associated with the words.

  • A lexicon is similar to a list or a set of words, but the internal representation is different and optimized for faster searches of words, prefixes and wildcard patterns.

  • Given a word, precisely, the search time is O(W) where W is the length of the word.

  • 2 important lexicon data-structures are Trie and Directed Acyclic Word Graph (DAWG).

Install

lexpy can be installed via Python Package Index (PyPI) using pip. The only installation requirement is that you need Python 3.10 or higher. It has no extra dependency requirements.

pip install lexpy

Interface

Interface Description Trie DAWG
Add a single word add('apple', count=2) add('apple', count=2)
Add multiple words add_all(['advantage', 'courage']) add_all(['advantage', 'courage'])
Check if exists? in operator in operator
Search using wildcard expression search('a?b*', with_count=True) search('a?b*, with_count=True)
Search for prefix matches search_with_prefix('bar', with_count=True) search_with_prefix('bar')
Search for similar words within given edit distance. Here, the notion of edit distance is same as Levenshtein distance search_within_distance('apble', dist=1, with_count=True) search_within_distance('apble', dist=1, with_count=True)
Get the number of nodes in the automaton len(trie) len(dawg)

Examples

Trie

Build from an input list, set, or tuple of words.

from lexpy import Trie

trie = Trie()

input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba',
               'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin',
               'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit',
               'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small'
               'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']

trie.add_all(input_words) # You can pass any sequence types or a file-like object here

print(trie.get_word_count())

>>> 48

Build from a file or file path.

In the file, words should be newline separated.

from lexpy import Trie

# Either
trie = Trie()
trie.add_all('/path/to/file.txt')

# Or
with open('/path/to/file.txt', 'r') as infile:
     trie.add_all(infile)

Check if exists using the in operator

print('ampyx' in trie)

>>> True

Prefix search

print(trie.search_with_prefix('ab'))

>>> ['abhor', 'abuzz']
print(trie.search_with_prefix('ab', with_count=True))

>>> [('abuzz', 2), ('abhor', 1)]

Wildcard search using ? and *

  • ? = 0 or 1 occurrence of any character

  • * = 0 or more occurrence of any character

print(trie.search('a*o*'))

>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']

print(trie.search('a*o*', with_count=True))

>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]

print(trie.search('su?t'))

>>> ['suit']

print(trie.search('su?t', with_count=True))

>>> [('suit', 1)]

Search for similar words using the notion of Levenshtein distance

print(trie.search_within_distance('arie', dist=2))

>>> ['athie', 'arbil', 'auric']

print(trie.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 3), ('arbil', 1), ('auric', 1)]

Increment word count

  • You can either add a new word or increment the counter for an existing word.
trie.add('athie', count=1000)

print(trie.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 1003), ('arbil', 1), ('auric', 1)]

Directed Acyclic Word Graph (DAWG)

  • DAWG supports the same set of operations as a Trie. The difference is the number of nodes in a DAWG is always less than or equal to the number of nodes in Trie.

  • They both are Deterministic Finite State Automata. However, DAWG is a minimized version of the Trie DFA.

  • In a Trie, prefix redundancy is removed. In a DAWG, both prefix and suffix redundancies are removed.

  • In the current implementation of DAWG, the insertion order of the words should be alphabetical.

  • The implementation idea of DAWG is borrowed from http://stevehanov.ca/blog/?id=115

from lexpy import Trie, DAWG

trie = Trie()
trie.add_all(['advantageous', 'courageous'])

dawg = DAWG()
dawg.add_all(['advantageous', 'courageous'])

len(trie) # Number of Nodes in Trie
23

dawg.reduce() # Perform DFA minimization. Call this every time a chunk of words are uploaded in DAWG.

len(dawg) # Number of nodes in DAWG
21

DAWG

The APIs are exactly same as the Trie APIs

Build a DAWG

from lexpy import DAWG
dawg = DAWG()

input_words = ['ampyx', 'abuzz', 'athie', 'athie', 'athie', 'amato', 'amato', 'aneto', 'aneto', 'aruba',
               'arrow', 'agony', 'altai', 'alisa', 'acorn', 'abhor', 'aurum', 'albay', 'arbil', 'albin',
               'almug', 'artha', 'algin', 'auric', 'sore', 'quilt', 'psychotic', 'eyes', 'cap', 'suit',
               'tank', 'common', 'lonely', 'likeable' 'language', 'shock', 'look', 'pet', 'dime', 'small'
               'dusty', 'accept', 'nasty', 'thrill', 'foot', 'steel', 'steel', 'steel', 'steel', 'abuzz']


dawg.add_all(input_words)
dawg.reduce()

dawg.get_word_count()

>>> 48

Check if exists using the in operator

print('ampyx' in dawg)

>>> True

Prefix search

print(dawg.search_with_prefix('ab'))

>>> ['abhor', 'abuzz']
print(dawg.search_with_prefix('ab', with_count=True))

>>> [('abuzz', 2), ('abhor', 1)]

Wildcard search using ? and *

? = 0 or 1 occurance of any character

* = 0 or more occurance of any character

print(dawg.search('a*o*'))

>>> ['amato', 'abhor', 'aneto', 'arrow', 'agony', 'acorn']

print(dawg.search('a*o*', with_count=True))

>>> [('amato', 2), ('abhor', 1), ('aneto', 2), ('arrow', 1), ('agony', 1), ('acorn', 1)]

print(dawg.search('su?t'))

>>> ['suit']

print(dawg.search('su?t', with_count=True))

>>> [('suit', 1)]

Search for similar words using the notion of Levenshtein distance

print(dawg.search_within_distance('arie', dist=2))

>>> ['athie', 'arbil', 'auric']

print(dawg.search_within_distance('arie', dist=2, with_count=True))

>>> [('athie', 3), ('arbil', 1), ('auric', 1)]

Alphabetical order insertion

If you insert a word which is lexicographically out-of-order, ValueError will be raised.

dawg.add('athie', count=1000)

ValueError

ValueError: Words should be inserted in Alphabetical order. <Previous word - thrill>, <Current word - athie>

Increment the word count

  • You can either add an alphabetically greater word with a specific count or increment the count of the previous added word.
dawg.add_all(['thrill']*20000) # or dawg.add('thrill', count=20000)

print(dawg.search('thrill', with_count=True))

>> [('thrill', 20001)]

Special Characters

Special characters, except ? and *, are matched literally.

from lexpy import Trie
t = Trie()
t.add('a©')
t.search('a©')
>> ['a©']
t.search('a?')
>> ['a©']
t.search('?©')
>> ['a©']

Trie vs DAWG

Number of nodes comparison

Build time comparison

Future Work

These are some ideas which I would love to work on next in that order. Pull requests or discussions are invited.

  • Merge trie and DAWG features in one data structure
    • Support all functionalities and still be as compressed as possible.
  • Serialization / Deserialization
    • Pickle is definitely an option.
  • Server (TCP or HTTP) to serve queries over the network.

Fun Facts

  1. The 45-letter word pneumonoultramicroscopicsilicovolcanoconiosis is the longest English word that appears in a major dictionary. So for all english words, the search time is bounded by O(45).
  2. The longest technical word(not in dictionary) is the name of a protein called as titin. It has 189,819 letters and it is disputed whether it is a word.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexpy-1.2.0.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lexpy-1.2.0-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file lexpy-1.2.0.tar.gz.

File metadata

  • Download URL: lexpy-1.2.0.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for lexpy-1.2.0.tar.gz
Algorithm Hash digest
SHA256 6ad835c726f0e3ec40e4fa4ff3dfe85275d7f756082ddd5e6c6c827707790469
MD5 377d8303c81a69e59909ebd704028638
BLAKE2b-256 75039ace9dd4624fa94456480ce9d8b9c7e94fc6dc2691472280881a31085a9c

See more details on using hashes here.

File details

Details for the file lexpy-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: lexpy-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for lexpy-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b2394884550e015d6ee38411e69fcef1027516a3b3d2c794fa97a546fd56b8cc
MD5 4eaa92e9729b5c3a05975dba8b77eab1
BLAKE2b-256 829f7c5219b707888a0486935c2c807e38a3aaa8efe276c6c4373c7a6d3a8e58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page