Skip to main content

Fast, non-overlapping simultaneous multiple keyword search

Project description

Non-Overlapping Aho-Corasick Trie

Features:
- 'short' and 'long' (longest key) searches, both one-off and iteration
over all in some text (non-overlapping). (Short probably isn't useful
without overlap.)
- Works with both unicode and str in Python 2, and unicode in Python 3
(it's all UCS4 under the hood).
- Allows you to associate an arbitrary Python object payload with each
keyword, and supports dict operations len, [], and 'in' for the
keywords (though no del or traversal).
- Does the 'compilation' (generation of Aho-Corasick failure links) of
the trie on-demand; you can mix adding keywords and searching text
freely.
- Can be used commercially, it has a minimal license.

Anti-Features:
- Does not support overlapping keywords, unless you move along the
string manually, one character at a time, which would defeat the
purpose.
- find[all]_short is kind of useless.
- Lacks key iteration and deletion from the mapping (dict) protocol
- memory leaking untested (should be ok but ...)
- No /testcase/ for unicode in Python 2 (did manual test however)
- Unicode chars represented as ucs4, and, each character has its own
hashtable, so it's relatively memory-heavy.
- Requires a C++ compiler.

Bug reports and patches welcome of course!



To build and install, use either
# Python 2
python setup.py install # (or build, and copy the .so to where you want it)
or
# Python 3
python3 setup.py install # (or build, and copy the .so to where you want it)


'text' below applies to str and unicode in Python 2, or unicode in Python 3 (all there is)
t.add(key_text, optional payload)
(start, end, key_payload) = t.find_short(text_to_search)
(start, end, key_payload) = t.find_long(text_to_search)
(start, end, key_payload) = t.findall_short(text_to_search)
(start, end, key_payload) = t.findall_long(text_to_search)

Examples:
>>> a = NoAho()
>>> a.add('ms windows')
>>> a.add('ms windows 2000', "this is canonical")
>>> a.add('windows', None)
>>> a.add('windows 2000')
>>> text = 'windows 2000 ms windows 2000 windows'
>>> for k in a.findall_short(text):
... print text[k[0]:k[1]]
...
windows
ms windows
windows
>>> for k in a.findall_long(text):
... print text[k[0]:k[1]]
...
windows 2000
ms windows 2000
windows

Mapping (dictionary) methods:
trie = NoAho()
trie['apple'] = apple_handling_function
trie['orange'] = Orange()
trie.add('banana') # payload will be None
trie['pear'] # will give key error
assert isinstance(trie['orange'], Orange)
assert 'banana' in trie
len(trie)
# No del;
# no iteration over keys

The 'find[all]_short' forms are named as long and awkwardly as they are,
to leave plain 'find[all]' free if overlapping matches are ever implemented.


Deep Rebuilding:
Written in C++ and Cython.

You should not need to use Cython to build and use this module, but if
you want to change it and regenerate, there's a build command line at
the top of noaho.pyx. To allow the generated noaho.cpp file to be
used in both Py2 and Py3, use a Python 3-built Cython. This was built
with Cython-0.15.1, but even as far back as 0.13 should work (for that
version though, you'd have to manually use a c++ compiler to do the
last step, of linking).


For the fullest spec of what the code will and will not so, check out
test-noaho.py (python[3] test-noaho.py)

Untested: whether the payload handling is complete, ie that there are no
memory leaks. It should be correct though.

Project details


Release history Release notifications

History Node

0.9.6.1

History Node

0.9.6

History Node

0.9.5

History Node

0.9.4.2

History Node

0.9.02

History Node

0.9.01

This version
History Node

0.9

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
NoAho-0.9.tar.gz (39.0 kB) Copy SHA256 hash SHA256 Source None Mar 20, 2012

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page