Skip to main content

A self-compressing, dependency-free radix trie

Project description

Build Status

Bytetrie

A fast, dependency-free, self-compressing trie with radix 256 in pure python.

Bytetrie allows fast prefix search in a large corpus of keys. Each key can be associated with arbitrary data. It features fast lookup times at the cost of expensive insertion. A Bytetrie is best used if it can be pre-filled with data. However, due to its in-band compression it can be also used for on-the-fly updates.

Keys

Keys are byte strings. Therefore, each node in the trie can have up to 256 children (the radix). Keys do work well with utf-8 and other encodings as long as the encoding is consistent and deterministic. That is, grapheme clusters are always encoded to the same byte sequence -- even if the standard allows for ambiguity. Usually that's a non-issue as long as the same encoder is used for insertion and lookup.

Since prefix search in unicode strings is one of the most common use-cases of bytetrie, a unicode layer on top of bytetrie is planned.

Data

Bytetrie can associate arbitrary python objects with keys. Data (or rather a reference thereof) is kept in-tree. No further processing is done.

In addition, bytrie allows multi-valued tries. Every key is then associated with a sequence of arbitrary objects.

Performance

Despite being in pure python bytetrie is fast. Sifting through the full geonames "allCountries" dataset for places starting with Vienna takes a mere 512µs. That's not even a millisecond for searching through 12,041,359 places. For comparison, a warmed-up ripgrep search through the same dataset takes three orders of magnitude (400ms) longer on the same machine.

On the downside, building the trie takes about 20 minutes and considerable memory. Also, the performance is mostly trumped by the time it takes to collect terminal nodes. The higher up the trie the search ends (and hence the more results the prefix search yields) the longer it takes. There are several low-hanging fruits left and further performance improvements are in the pipeline.

Dependencies

None. That's the point.

Getting started

Install bytetrie via pip.

pip install -U bytetrie

The public interface is ByteTrie with the two methods insert and find. Find returns a list of Terminals from which the key and the value of the node can be retrieved.

from bytetrie import ByteTrie

t = ByteTrie(multi_value=True)
t.insert(b"Hallo", "Dutch")
t.insert(b"Hello", "English")
t.insert(b"Hug", "Gaelic")
t.insert(b"Hallo", "German")
t.insert("Hē".encode("utf-8"), "Hindi")
t.insert("Halló".encode("utf-8"), "Icelandic")
t.insert(b"Hej", "Polish")
t.insert(b"Hei", "Romanian")
t.insert(b"Hujambo", "Swahili")
t.insert(b"Hej", "Swedish")
t.insert(b"Helo", "Welsh")

print("Where to say 'Hi' with 'He'?") 
print(f"{[(n.key(), n.value()) for n in t.find(b'He')]}")
# [(b'Hei', ['Romanian']), (b'Hej', ['Swedish', 'Polish']), (b'Helo', ['Welsh']), (b'Hello', ['English'])]

print("Where to say 'Hi' with 'Ha'?") 
print(f"{[(n.key().decode(), n.value()) for n in t.find(b'Ha')]}")
# [('Halló', ['Icelandic']), ('Hallo', ['German', 'Dutch'])]

print("Where to say 'Hi' with 'Hē'?") 
print(f"Say 'Hi' with utf-8: {[(n.key().decode(), n.value()) for n in t.find('Hē'.encode())]}")
# [('Hē', ['Hindi'])]

Contribute

If you want to contribute to bytetrie feel free to send patches to dev[at]friedl[dot]net. Alternatviely, you can issue a pull request on GitHub which will be cherry picked into my tree. If you plan significant long-term contributions drop me a mail for access to the incubator repository.

Github Users

If you are visiting this repository on GitHub, you are on a mirror of https://git.friedl.net/incubator/bytetrie. This mirror is regularily updated with my other GitHub mirrors.

Like with my other incubator projects, once I consider bytetrie reasonable stable the main tree will move to GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytetrie-0.1.0.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

bytetrie-0.1.0-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file bytetrie-0.1.0.tar.gz.

File metadata

  • Download URL: bytetrie-0.1.0.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0

File hashes

Hashes for bytetrie-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bfc17587534b5431c38d2323dd2c8d053b3c28392b61599c8153ae37167f6f1c
MD5 d6bfb70cc64c009eb5007c2fe494b147
BLAKE2b-256 589d5bde2642a576f9823d9f83728e1d09552094b98ac00af07c90146c94397d

See more details on using hashes here.

File details

Details for the file bytetrie-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bytetrie-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0

File hashes

Hashes for bytetrie-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1704d9171921bc693599bdbc28a0431f8ee968ec60bf8a8f78bb398f09d06de4
MD5 0abbc432562d0c6bb8278d52e6c6a809
BLAKE2b-256 7a693264c4e3c8814b4057760c6c1acea2f7cde3f302930e58f198476bbca118

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page