Skip to main content

Fast, non-overlapping simultaneous multiple keyword search

Project description

Non-Overlapping Aho-Corasick Trie

Features:
- 'short' and 'long' (longest matching key) searches, both one-off and
iteration over all non-overlapping keyword matches in some text.
- Works with both unicode and str in Python 2, and unicode in Python 3. NOTE:
As everything is simply single UCS4 / UTF-32 codepoints under the hood, all
substrings and input unicode must be normalized, ie any separate modifying
marks must be folded into each codepoint. See:
http://stackoverflow.com/questions/16467479/normalizing-unicode
Or, theoretically, you could put into the tree all forms of the
keywords you expect to see in your text.
- Allows you to associate an arbitrary Python object payload with each
keyword, and supports dict operations len(), [], and 'in' for the
keywords (though no del or traversal).
- Does the 'compilation' (generation of Aho-Corasick failure links) of
the trie on-demand, ie you could mix adding keywords and searching
text, freely, but mostly it just relieves you of worrying about
compiling.
- Can be used commercially, it's under the minimal, MIT license (if you
somehow need a different license, ask me, I mean for it to be used).

Anti-Features:
- Will not find overlapped keywords (eg given keywords 'abc' and 'cdef', will
not find 'cdef' in 'abcdef'. Any full Aho-Corasick implementation would give
you both. The package 'Acora' is an alternative package for this use. (noaho
can be relatively easily modified to be a normal Aho-Corasick, but it wasn't
what I personally needed.)
- Lacking overlap, find[all]_short is kind of useless.
- Lacks key iteration and deletion from the mapping (dict) protocol.
- Memory leaking untested (one run under valgrind turned up nothing, but it
wasn't extensive).
- No /testcase/ for unicode in Python 2 (did manual test however)
Unicode chars represented as ucs4, and, each character has its own hashtable,
so it's relatively memory-heavy (see 'Ways to Reduce Memory Use' below).
- Requires a C++ compiler (C++98 support is enough).

Bug reports and patches welcome of course!


To build and install, use either
pip install noaho
or
# Python 2
python2 setup.py install # (or ... build, and copy the .so to where you want it)
pip install
or
# Python 3
python3 setup.py install # (or ... build, and copy the .so to where you want it)


API:
from noaho import NoAho
trie = NoAho()
'text' below applies to str and unicode in Python 2, or unicode in Python 3 (all there is)
trie.add(key_text, optional payload)
(key_start, key_end, key_value) = trie.find_short(text_to_search)
(key_start, key_end, key_value) = trie.find_long(text_to_search)
(key_start, key_end, key_value) = trie.findall_short(text_to_search)
(key_start, key_end, key_value) = trie.findall_long(text_to_search)
# keyword = text_to_search[key_start:key_end]
trie['keyword] = key_value
key_value = trie.find_long(text_to_search)
assert len(trie)
assert keyword in trie

Examples:
>>> a = NoAho()
>>> a.add('ms windows')
>>> a.add('ms windows 2000', "this is canonical")
>>> a.add('windows', None)
>>> a.add('windows 2000')
>>> a['apple'] = None
>>> text = 'windows 2000 ms windows 2000 windows'
>>> for k in a.findall_short(text):
... print text[k[0]:k[1]]
...
windows
ms windows
windows
>>> for k in a.findall_long(text):
... print text[k[0]:k[1]]
...
windows 2000
ms windows 2000
windows

Mapping (dictionary) methods:
trie = NoAho()
trie['apple'] = apple_handling_function
trie['orange'] = Orange()
trie.add('banana') # payload will be None
trie['pear'] # will give key error
assert isinstance(trie['orange'], Orange)
assert 'banana' in trie
len(trie)
# No del;
# no iteration over keys

The 'find[all]_short' forms are named as long and awkwardly as they are,
to leave plain 'find[all]' free if overlapping matches are ever implemented.


For the fullest spec of what the code will and will not do, check out
test-noaho.py (run it with: python[3] test-noaho.py)

Untested: whether the payload handling is complete, ie that there are no
memory leaks. It should be correct though.


Regenerating the Python Wrapper:
- Needs a C++ compiler (C++98 is fine) and Cython.

You do not need to rebuild the Cython wrapper (the generated noaho.cpp), but if
you want to make changes to the module itself, there is a script:

test-all-configurations.sh

which will, with minor configuration tweaking, rebuild and test against both
python 2 and 3. It requires you to have a Cython tarball in the top directory.
Note that the python you used to install Cython should be the same as the one
you use to do the regeneration, because the regeneration setup includes a module
Cython.Distutils, from the installation.

Cython generates python-wrapper noaho.cpp from noaho.pyx (be careful
to distinguish it from the misnamed array-aho.* (it uses hash tables),
which is the original C++ code).

Ways to Reduce Memory Use:
One of its aims is to handle Unicode, which means you have to accommodate a huge
branching factor, thus the hashtable (a full array would be out of the
question). Ways to attack memory size might be, to either force very
conservative hashtable growth, or, once the trie is complete (in 'compile', say)
go through the tree and replace the hashtables with just-the-right-size arrays -
linear scan / binary search should be fast enough if small enough, and take less
memory. If you're willing to do a linear scan at that point, you could switch to
UTF-8, too, saving quite a bit of memory. Danny Yoo's original code I think just
started out as arrays and would grow to hashtables when needed.

Also, if all you need is ASCII, you could re-define AC_CHAR_TYPE to be 'char'.
I've tried to be careful to use AC_CHAR_TYPE consistently, but you'd probably
want to go through the code to make sure if you're going to rely on this. Python
3 uses Unicode internally though and would do a lot of conversions anyway.
Otherwise, I don't trust my knowledge of Unicode enough to try to play games
with storing fewer bits.

In the Hopper:
I have a case-insensitive version (the easiest thing is just to downcase
everything you add or search for in noaho.pyx), and, one that will only yield
keywords at word boundaries, thanks to Python's unicode character classes.
(However, this second is a bit raw, and you can do it manually anyway.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

NoAho-0.9.6.1.zip (54.0 kB view details)

Uploaded Source

NoAho-0.9.6.1.tar.gz (47.2 kB view details)

Uploaded Source

NoAho-0.9.6.1.tar.bz2 (40.5 kB view details)

Uploaded Source

File details

Details for the file NoAho-0.9.6.1.zip.

File metadata

  • Download URL: NoAho-0.9.6.1.zip
  • Upload date:
  • Size: 54.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for NoAho-0.9.6.1.zip
Algorithm Hash digest
SHA256 4ffcf29e7002203f39a8f254ad7f792351dcb601aea84988f7e5cd4c4a027888
MD5 0304c2d48e5db251debf74f044ae8ce1
BLAKE2b-256 f1ea4c6f909222da563daf274ddb4a9be70807c91d2a2bafa79d2c1f4372b183

See more details on using hashes here.

File details

Details for the file NoAho-0.9.6.1.tar.gz.

File metadata

  • Download URL: NoAho-0.9.6.1.tar.gz
  • Upload date:
  • Size: 47.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for NoAho-0.9.6.1.tar.gz
Algorithm Hash digest
SHA256 d41961d8b46ca260c102be9b9ebde4a4d3e0d5babc3a1e63bb6478a2f2029feb
MD5 85816f8c2b099014c52f3f73256a0c8a
BLAKE2b-256 2def1462ac8dc402b0ad622a64f10b394cf5136b47636e6e20ef39f3357406a6

See more details on using hashes here.

File details

Details for the file NoAho-0.9.6.1.tar.bz2.

File metadata

  • Download URL: NoAho-0.9.6.1.tar.bz2
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for NoAho-0.9.6.1.tar.bz2
Algorithm Hash digest
SHA256 f07b3b313cf063395dac0f2030a7e5c281f039c630ed565d125f441e11c975a5
MD5 1418a6b84ba5454629b5861f14c0f81f
BLAKE2b-256 334a3ea8d85ce45ed7fcb3785c80e8047bb1ad4407608381bb1fc516ea6a4780

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page