Skip to main content

Pure-python reader for DAWGs (DAFSAs) created by dawgdic C++ library or DAWG Python extension.

Project description

DAWG2-Python

Python tests Coverage Status

This pure-python package provides read-only access for files created by dawgdic C++ library and DAWG python package.

This package is not capable of creating DAWGs. It works with DAWGs built by dawgdic C++ library or DAWG Python extension module. The main purpose of DAWG-Python is to provide access to DAWGs without requiring compiled extensions. It is also quite fast under PyPy (see benchmarks).

Installation

pip install DAWG2-Python

Usage

The aim of DAWG2-Python is to be API- and binary-compatible with DAWG when it is possible.

First, you have to create a dawg using DAWG module:

import dawg

d = dawg.DAWG(data)
d.save('words.dawg')

And then this dawg can be loaded without requiring C extensions:

import dawg_python

d = dawg_python.DAWG().load('words.dawg')

Please consult DAWG docs for detailed usage. Some features (like constructor parameters or save method) are intentionally unsupported.

Benchmarks

Benchmark results (100k unicode words, integer values (lengths of the words), PyPy 1.9, macbook air i5 1.8 Ghz):

dict __getitem__ (hits):        11.090M ops/sec
DAWG __getitem__ (hits):        not supported
BytesDAWG __getitem__ (hits):   0.493M ops/sec
RecordDAWG __getitem__ (hits):  0.376M ops/sec

dict get() (hits):              10.127M ops/sec
DAWG get() (hits):              not supported
BytesDAWG get() (hits):         0.481M ops/sec
RecordDAWG get() (hits):        0.402M ops/sec
dict get() (misses):            14.885M ops/sec
DAWG get() (misses):            not supported
BytesDAWG get() (misses):       1.259M ops/sec
RecordDAWG get() (misses):      1.337M ops/sec

dict __contains__ (hits):           11.100M ops/sec
DAWG __contains__ (hits):           1.317M ops/sec
BytesDAWG __contains__ (hits):      1.107M ops/sec
RecordDAWG __contains__ (hits):     1.095M ops/sec

dict __contains__ (misses):         10.567M ops/sec
DAWG __contains__ (misses):         1.902M ops/sec
BytesDAWG __contains__ (misses):    1.873M ops/sec
RecordDAWG __contains__ (misses):   1.862M ops/sec

dict items():           44.401 ops/sec
DAWG items():           not supported
BytesDAWG items():      3.226 ops/sec
RecordDAWG items():     2.987 ops/sec
dict keys():            426.250 ops/sec
DAWG keys():            not supported
BytesDAWG keys():       6.050 ops/sec
RecordDAWG keys():      6.363 ops/sec

DAWG.prefixes (hits):    0.756M ops/sec
DAWG.prefixes (mixed):   1.965M ops/sec
DAWG.prefixes (misses):  1.773M ops/sec

RecordDAWG.keys(prefix="xxx"), avg_len(res)==415:       1.429K ops/sec
RecordDAWG.keys(prefix="xxxxx"), avg_len(res)==17:      36.994K ops/sec
RecordDAWG.keys(prefix="xxxxxxxx"), avg_len(res)==3:    121.897K ops/sec
RecordDAWG.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 265.015K ops/sec
RecordDAWG.keys(prefix="xxx"), NON_EXISTING:            2450.898K ops/sec

Under CPython expect it to be about 50x slower. Memory consumption of DAWG-Python should be the same as of DAWG.

Current limitations

  • This package is not capable of creating DAWGs;
  • all the limitations of DAWG apply.

Contributions are welcome!

Contributing

Feel free to submit ideas, bugs or pull requests.

Running tests and benchmarks

Make sure pytest is installed and run

$ pytest .

from the source checkout. Tests should pass under python 3.8, 3.9, 3.10, 3.11 and PyPy3 >= 7.3.

In order to run benchmarks, type

$ pypy3 -m bench.speed

This runs benchmarks under PyPy (they are about 50x slower under CPython).

Authors & Contributors

The algorithms are from dawgdic C++ library by Susumu Yata & contributors.

License

This package is licensed under MIT License.

Changes

0.8.1 (2024-08-01)

Minor technical update:

  • fixed typo in github link
  • updated dependencies

0.8.0 (2023-09-27)

  • Allow more flexible char substitutes by @bt2901
  • minimal Python version changed to 3.8 by @insolor
  • setup.py building changed to poetry by @insolor

0.7.2 (2015-04-18)

  • minor speedup;
  • bitbucket mirror is no longer maintained.

0.7.1 (2014-06-05)

  • Switch to setuptools;
  • upload wheel to pypi;
  • check Python 3.4 compatibility.

0.7 (2013-10-13)

IntDAWG and IntCompletionDAWG are implemented.

0.6 (2013-03-23)

Use less shared state internally. This should fix thread-safety bugs and make iterkeys/iteritems reentrant.

0.5.1 (2013-03-01)

Internal tweaks: memory usage is reduced; something is a bit faster, something is a bit slower.

0.5 (2012-10-08)

Storage scheme is updated to match DAWG==0.5. This enables the alphabetical ordering of BytesDAWG and RecordDAWG items.

In order to read BytesDAWG or RecordDAWG created with versions of DAWG < 0.5 use payload_separator constructor argument:

>>> BytesDAWG(payload_separator=b'\xff').load('old.dawg')

0.3.1 (2012-10-01)

Bug with empty DAWGs is fixed.

0.3 (2012-09-26)

  • iterkeys and iteritems methods.

0.2 (2012-09-24)

prefixes support.

0.1 (2012-09-20)

Initial release.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dawg2_python-0.8.1.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

dawg2_python-0.8.1-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file dawg2_python-0.8.1.tar.gz.

File metadata

  • Download URL: dawg2_python-0.8.1.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for dawg2_python-0.8.1.tar.gz
Algorithm Hash digest
SHA256 67e423e344a3561429db52970cd2eea62dc9244db1687543b2047836943ae513
MD5 455389f429a796111a1f6eb1b9f8d82b
BLAKE2b-256 ab7b3c52cdd059cc1158a656278879393b3aefd4f8ad42c778ed36e1d6460704

See more details on using hashes here.

File details

Details for the file dawg2_python-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: dawg2_python-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for dawg2_python-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6d94bc3f9adb5c935eced43fd176700789eb4da917bd5f4e236101a219460501
MD5 41359e427c6e29864008f55d931f05d3
BLAKE2b-256 a9dcaec540ca8e1dccb3a97494349e91d1b6f75832bb0626acfc5b24056b0411

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page