Skip to main content

Python bindings to marisa-trie (unofficial)

Project description


MARISA-Trie structure for Python (2.x and 3.x). Uses marisa-trie C++ library.

MARISA-Trie is a static trie that is very memory efficient and fairly fast.

There are official SWIG-based Python bindings included in C++ library distribution; this package provides an alternative unofficial Cython-based pip-installable Python bindings.


pip install marisa-trie


Create a new trie:

>>> import marisa_trie
>>> trie = marisa_trie.Trie()

Build a trie:

>>>[u'key1', u'key2', u'key12'])
<marisa_trie.Trie at ...>

Check if key is in trie:

>>> u'key1' in trie
>>> u'key20' in trie

Each key is assigned an unique ID from 0 to (n - 1), where n is the number of keys; you can use this ID to store a value in a separate structure (e.g. python list):

>>> trie.key_id(u'key2')


In future versions dict-like interface may become builtin.

Key can be reconstructed from the ID:

>>> trie.restore_key(1)

Find all prefixes of a given key:

>>> trie.prefixes(u'key12')
[u'key1', u'key12']

There is also a generator version of .prefixes method called .iter_prefixes.

Find all keys from this trie that starts with a given prefix:

>> trie.keys(u'key1')
[u'key1', u'key12']

(iterator version .iterkeys(prefix) is also available).

It is possible to save a trie to a file:

>>> with open('my_trie.marisa', 'w') as f:
...     trie.write(f)



Load a trie:

>>> trie2 = marisa.Trie()
>>> with open('my_trie.marisa', 'r') as f:
...     trie.load(f)


>>> trie2.load('my_trie.marisa')

Trie objects are picklable:

>>> import pickle
>>> data = pickle.dumps(trie)
>>> trie3 = pickle.loads(data)

You could also build a trie using marisa-build command-line utility (provided by underlying C library; it should be downloaded and compiled separately) and then load it from resulting file using .load() method.


My quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):

  • list(unicode words) : about 300M
  • BaseTrie from datrie library: about 70M
  • marisa_trie.Trie: 7M


This is not a fair comparison because datrie.BaseTrie is able to store arbitrary integers as values and marisa_trie.Trie uses auto-assigned IDs.

Some speed data for marisa_trie.Trie (100k unicode words, Python 3.2, macbook air i5 1.8 Ghz):

dict __contains__ (hits):       4.147M ops/sec
trie __contains__ (hits):       0.887M ops/sec
dict __contains__ (misses):     3.234M ops/sec
trie __contains__ (misses):     1.529M ops/sec
dict __len__:                   599186.286 ops/sec
trie __len__:                   433893.517 ops/sec
dict keys():                    215.424 ops/sec
trie keys():                    3.425 ops/sec
trie.iter_prefixes (hits):      0.169M ops/sec
trie.iter_prefixes (misses):    0.822M ops/sec
trie.iter_prefixes (mixed):     0.747M ops/sec

trie.keys(prefix="xxx"), avg_len(res)==415:         0.840K ops/sec
trie.keys(prefix="xxxxx"), avg_len(res)==17:        19.172K ops/sec
trie.keys(prefix="xxxxxxxx"), avg_len(res)==3:      82.777K ops/sec
trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4:   131.348K ops/sec
trie.keys(prefix="xxx"), NON_EXISTING:              1027.093K ops/sec

So marisa_trie.Trie uses less memory, datrie.Trie is faster.


Development happens at github and bitbucket:

The main issue tracker is at github:

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

If you found a bug in a C++ part please report it to the original bug tracker.

Running tests and benchmarks

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.


At the moment of writing the latest pip release (1.1) does not support Python 3.3; in order to run tox tests under Python 3.3 find the “virtualenv_support” directory in site-packages (of the env you run tox from) and place an sdist zip/tarball of the newer pip (from github) there.

$ tox -c bench.ini

runs benchmarks.

Authors & Contributors

This module is based on marisa-trie C++ library by Susumu Yata & contributors.


Wrapper code is licensed under MIT License. Bundled marisa-trie C++ library is licensed under BSD license.


0.2 (2012-08-19)

  • Pickling/unpickling support;
  • dumps/loads methods;
  • python 3.3 workaround;
  • improved tests;
  • benchmarks.

0.1 (2012-08-17)

Initial release.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for marisa-trie, version 0.2
Filename, size File type Python version Upload date Hashes
Filename, size marisa-trie-0.2.tar.gz (125.3 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page