Skip to main content

Fast and compact maps and sets with Billions of keys, based on finite-state-transducers.

Project description

Ducer

This package provides Python bindings for the excellent Rust crate fst by Andrew Gallant. Map and Set classes allow building compact representations from sorted Python iterables.

Performance

Ducer maps and sets can be built and queried at millions of keys per second. Consider the following example:

n = 1_000_000_000
items = ((b"%09d" % i, n-i) for i in range(n))
data = ducer.Map.build(items, ":memory:")
print(len(data))

Creating this map takes about 4 minutes on my humble laptop, which translates to almost 4 Million items per second. About 3 minutes is spent in Python just creating the item tuples. This is a pathological scenario, the resulting output is just 464 bytes. A real-world example with 1.1 Billion keys, where the msgpacked key-value pairs occupy 21 GiB (without any kind of searchability), results in a 4.6 GiB file.

Limitations

Performance is rarely free, so there are some limitations you should consider before proceeding:

  • Keys must be bytes
  • Keys must be inserted in lexicographical order
  • Map values must be positive integers less than 2^64
  • Once built, maps and sets cannot be altered

Usage

Building & opening

Above, we created a map with data stored in memory. If you want to store your map in a file, you can also give a path instead:

ducer.Map.build(items, "path/to/my.map")

Building a map like this is a very memory-efficient. You can also stream a map without loading all data into memory, e.g., using the builtin mmap:

with open("path/to/my.map", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    mm.madvise(mmap.MADV_RANDOM)
    m = ducer.Map(mm)

Access

Other than being read-only, maps and sets behave the same as the builtin Python dict and set. Please open an issue if you find something that doesn't work. For map m: Map and key k: bytes, the following all work as intended:

k in m  # bool
m[k]  # value except KeyError
m.get(k)  # value or None
m.get(k, 0)  # value or 0
len(m)  # number of items
for k in m:  # iterate over keys
    pass
for v in m.values():  # iterate over values
    pass
for k, v in m.items():  # iterate over items
    pass

TODO advanced search & operations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ducer-0.2.1.tar.gz (15.8 kB view hashes)

Uploaded Source

Built Distributions

ducer-0.2.1-cp312-cp312-win_amd64.whl (255.4 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

ducer-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (328.1 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

ducer-0.2.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (309.5 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

ducer-0.2.1-cp312-cp312-macosx_11_0_arm64.whl (292.9 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

ducer-0.2.1-cp312-cp312-macosx_10_9_x86_64.whl (306.1 kB view hashes)

Uploaded CPython 3.12 macOS 10.9+ x86-64

ducer-0.2.1-cp311-cp311-win_amd64.whl (256.7 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

ducer-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (327.9 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

ducer-0.2.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (309.7 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

ducer-0.2.1-cp311-cp311-macosx_11_0_arm64.whl (292.6 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

ducer-0.2.1-cp311-cp311-macosx_10_9_x86_64.whl (306.9 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

ducer-0.2.1-cp310-cp310-win_amd64.whl (256.7 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

ducer-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (327.9 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

ducer-0.2.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (309.7 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

ducer-0.2.1-cp310-cp310-macosx_11_0_arm64.whl (292.6 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

ducer-0.2.1-cp310-cp310-macosx_10_9_x86_64.whl (306.8 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

ducer-0.2.1-cp39-cp39-win_amd64.whl (257.2 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

ducer-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (328.4 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

ducer-0.2.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (310.1 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

ducer-0.2.1-cp39-cp39-macosx_11_0_arm64.whl (293.1 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

ducer-0.2.1-cp39-cp39-macosx_10_9_x86_64.whl (307.3 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

ducer-0.2.1-cp38-cp38-win_amd64.whl (257.2 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

ducer-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (328.7 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

ducer-0.2.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (310.2 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

ducer-0.2.1-cp38-cp38-macosx_11_0_arm64.whl (293.2 kB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

ducer-0.2.1-cp38-cp38-macosx_10_9_x86_64.whl (307.4 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page