Fast and memory efficient DAWG for Python
Project description
DAWG
This package provides DAWG-based dictionary-like read-only objects for Python (2.x and 3.x).
String data in a DAWG (Directed Acyclic Word Graph) may take 200x less memory than in a standard Python dict or list and the raw lookup speed is comparable. DAWG may be even faster than built-in dict for some operations. It also provides fast advanced methods like prefix search.
Based on dawgdic C++ library.
Installation
pip install DAWG
Usage
There are several DAWG classes in this package:
dawg.DAWG - basic DAWG wrapper; it can store unicode keys and do exact lookups;
dawg.CompletionDAWG - dawg.DAWG subclass that supports key completion and prefix lookups (but requires more memory);
dawg.BytesDAWG - dawg.CompletionDAWG subclass that maps unicode keys to lists of bytes objects.
dawg.RecordDAWG - dawg.BytesDAWG subclass that maps unicode keys to lists of data tuples. All tuples must be of the same format (the data is packed using python struct module).
dawg.IntDAWG - dawg.DAWG subclass that maps unicode keys to integer values.
DAWG and CompletionDAWG
DAWG and CompletionDAWG are useful when you need fast & memory efficient simple string storage. These classes does not support assigning values to keys.
DAWG and CompletionDAWG constructors accept an iterable with keys:
>>> import dawg >>> words = [u'foo', u'bar', u'foobar', u'foö', u'bör'] >>> base_dawg = dawg.DAWG(words) >>> completion_dawg = dawg.CompletionDAWG(words)
It is then possible to check if the key is in a DAWG:
>>> u'foo' in base_dawg True >>> u'baz' in completion_dawg False
It is possible to find all keys that starts with a given prefix in a CompletionDAWG:
>>> completion_dawg.keys(u'foo') >>> [u'foo', u'foobar']
It is possible to find all keys similar to a given key (using a one-way char translation table):
>>> replaces = dawg.DAWG.compile_replaces({u'o': u'ö'}) >>> base_dawg.similar_keys(u'foo', replaces) [u'foo', u'foö'] >>> base_dawg.similar_keys(u'foö', replaces) [u'foö'] >>> base_dawg.similar_keys(u'bor', replaces) [u'bör']
BytesDAWG
BytesDAWG is a CompletionDAWG subclass that can store binary data for each key.
BytesDAWG constructor accepts an iterable with (unicode_key, bytes_value) tuples:
>>> data = [(u'key1', b'value1'), (u'key2', b'value2'), (u'key1', b'value3')] >>> bytes_dawg = dawg.BytesDAWG(data)
There can be duplicate keys; all unique values are stored in this case:
>>> bytes_dawg[u'key1'] [b'value1, b'value3']
For unique keys a list with a single value is returned for consistency:
>>> bytes_dawg[u'key2'] [b'value2']
KeyError is raised for missing keys; use get method if you need a default value instead:
>>> bytes_dawg.get(u'foo', None) None
BytesDAWG support keys and items methods (they both accept optional key prefix). There is also support for similar_keys, similar_items and similar_item_values methods.
RecordDAWG
RecordDAWG is a BytesDAWG subclass that automatically packs & unpacks the binary data from/to Python objects using struct module from the standard library.
First, you have to define a format of the data. Consult Python docs (http://docs.python.org/library/struct.html#format-strings) for the format string specification.
For example, let’s store 3 short unsigned numbers (in a Big-Endian byte order) as values:
>>> format = ">HHH"
RecordDAWG constructor accepts an iterable with (unicode_key, value_tuple). Let’s create such iterable using zip function:
>>> keys = [u'foo', u'bar', u'foobar', u'foo'] >>> values = [(1, 2, 3), (2, 1, 0), (3, 3, 3), (2, 1, 5)] >>> data = zip(keys, values) >>> record_dawg = RecordDAWG(format, data)
As with BytesDAWG, there can be several values for the same key:
>>> record_dawg['foo'] [(1, 2, 3), (2, 1, 5)] >>> record_dawg['foobar'] [(3, 3, 3)]
IntDAWG
IntDAWG is a {unicode -> int} mapping. It is possible to use RecordDAWG for this, but IntDAWG is natively supported by dawgdic C++ library and so __getitem__ is much faster.
Unlike BytesDAWG and RecordDAWG, IntDAWG doesn’t support having several values for the same key.
IntDAWG constructor accepts an iterable with (unicode_key, integer_value) tuples:
>>> data = [ (u'foo', 1), (u'bar', 2) ] >>> int_dawg = dawg.IntDAWG(data)
It is then possible to get a value from the IntDAWG:
>>> int_dawg[u'foo'] 1
Persistence
All DAWGs support saving/loading and pickling/unpickling.
Write DAWG to a stream:
>>> with open('words.dawg', 'wb') as f: ... d.write(f)
Save DAWG to a file:
>>> d.save('words.dawg')
Load DAWG from a file:
>>> d = dawg.DAWG() >>> d.load('words.dawg')
Read DAWG from a stream:
>>> d = dawg.RecordDAWG(format_string) >>> with open('words.record-dawg', 'rb') as f: ... d.read(f)
DAWG objects are picklable:
>>> import pickle >>> data = pickle.dumps(d) >>> d2 = pickle.loads(data)
Benchmarks
For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):
dict(unicode words -> word lenghts): about 600M
list(unicode words) : about 300M
marisa_trie.RecordTrie : 11M
marisa_trie.Trie: 7M
dawg.DAWG: 2M
dawg.CompletionDAWG: 3M
dawg.IntDAWG: 2.7M
dawg.RecordDAWG: 4M
Benchmark results (100k unicode words, integer values (lenghts of the words), Python 3.2, macbook air i5 1.8 Ghz):
dict __getitem__ (hits): 4.102M ops/sec DAWG __getitem__ (hits): not supported BytesDAWG __getitem__ (hits): 1.558M ops/sec RecordDAWG __getitem__ (hits): 0.950M ops/sec IntDAWG __getitem__ (hits): 2.835M ops/sec dict get() (hits): 3.053M ops/sec DAWG get() (hits): not supported BytesDAWG get() (hits): 1.340M ops/sec RecordDAWG get() (hits): 0.882M ops/sec IntDAWG get() (hits): 2.370M ops/sec dict get() (misses): 3.250M ops/sec DAWG get() (misses): not supported BytesDAWG get() (misses): 2.483M ops/sec RecordDAWG get() (misses): 2.249M ops/sec IntDAWG get() (misses): 2.806M ops/sec dict __contains__ (hits): 4.068M ops/sec DAWG __contains__ (hits): 3.065M ops/sec BytesDAWG __contains__ (hits): 2.627M ops/sec RecordDAWG __contains__ (hits): 2.613M ops/sec IntDAWG __contains__ (hits): 3.021M ops/sec dict __contains__ (misses): 3.471M ops/sec DAWG __contains__ (misses): 3.537M ops/sec BytesDAWG __contains__ (misses): 3.381M ops/sec RecordDAWG __contains__ (misses): 3.361M ops/sec IntDAWG __contains__ (misses): 3.540M ops/sec dict items(): 58.754 ops/sec DAWG items(): not supported BytesDAWG items(): 15.914 ops/sec RecordDAWG items(): 10.699 ops/sec IntDAWG items(): not supported dict keys(): 214.499 ops/sec DAWG keys(): not supported BytesDAWG keys(): 23.929 ops/sec RecordDAWG keys(): 23.726 ops/sec IntDAWG keys(): not supported RecordDAWG.keys(prefix="xxx"), avg_len(res)==415: 6.057K ops/sec RecordDAWG.keys(prefix="xxxxx"), avg_len(res)==17: 130.680K ops/sec RecordDAWG.keys(prefix="xxxxxxxx"), avg_len(res)==3: 507.355K ops/sec RecordDAWG.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 745.566K ops/sec RecordDAWG.keys(prefix="xxx"), NON_EXISTING: 3032.758K ops/sec
Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set.
Current limitations
IntDAWG is currently a subclass of DAWG and so it doesn’t support keys() and items() methods;
read() method reads the whole stream (DAWG must be the last or the only item in a stream if it is read with read() method) - pickling doesn’t have this limitation;
DAWGs loaded with read() and unpickled DAWGs uses 3x-4x memory compared to DAWGs loaded with load() method;
iterator versions of methods are not always implemented;
there are keys() and items() methods but no values() method.
prefixes() method for getting all prefixes of a given work is not implemented yet;
BytesDAWG and RecordDAWG has a limitation: values larger than 8KB are unsupported.
Contributions are welcome!
Contributing
Development happens at github and bitbucket:
The main issue tracker is at github: https://github.com/kmike/DAWG/issues
Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.
If you found a bug in a C++ part please report it to the original bug tracker.
How is source code organized
There are 4 folders in repository:
bench - benchmarks & benchmark data;
lib - original unmodified dawgdic C++ library and a customized version of libb64 library. They are bundled for easier distribution; if something is have to be fixed in these libraries consider fixing it in the original repositories;
src - wrapper code; src/dawg.pyx is a wrapper implementation; src/*.pxd files are Cython headers for corresponding C++ headers; src/*.cpp files are the pre-built extension code and shouldn’t be modified directly (they should be updated via update_cpp.sh script).
tests - the test suite.
Running tests and benchmarks
Make sure tox is installed and run
$ tox
from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.
In order to run benchmarks, type
$ tox -c bench.ini
License
Wrapper code is licensed under MIT License. Bundled dawgdic C++ library is licensed under BSD license. libb64 is Public Domain.
0.3.1 (2012-09-20)
bundled dawgdic C++ library is updated to the latest version.
0.3 (2012-09-13)
similar_keys, similar_items and similar_item_values methods for more permissive lookups (they may be useful e.g. for umlaut handling);
load method returns self;
Python 3.3 support.
0.2 (2012-09-08)
Greatly improved memory usage for DAWGs loaded with load method.
There is currently a bug somewhere in a wrapper so DAWGs loaded with read() method or unpickled DAWGs uses 3x-4x memory compared to DAWGs loaded with load() method. load() is fixed in this release but other methods are not.
0.1 (2012-09-08)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.