Skip to main content

Crunch 100+ GB Strings in Python with ease

Project description

Stringzilla 🦖

Crunch 100+ GB Strings in Python with ease, leveraging SIMD Assembly

Stringzilla was born many years ago as a tutorial for SIMD accelerated string-processing. But one day, processing 100+ GB of Chemistry and AI datasets, I transformed it into a library. It's designed to replace open(...).readlines(), str().splitlines(), and many other typical workloads with very long strings.

Benchmark IoT Arm Laptop x86 Server
Python: str.find 4 MB/s 14 MB/s 11 MB/s
C++: std::string::find 560 MB/s 1,2 GB/s 1,3 GB/s
Stringzilla 4,3 Gb/s 12 GB/s 12,1 GB/s

Usage

pip install stringzilla

There are two classes you can use interchangeably:

from stringzilla import Str, File, Strs

text: str = 'some-string'
text: Str = Str('some-string')
text: File = File('some-file.txt')

Once constructed, the following interfaces are supported:

len(text) -> int
'substring' in text -> bool
text[42] -> str

text.contains(
    'subtring',
    start=0, # optional
    end=9223372036854775807, # optional
) -> bool

text.find(
    'subtring',
    start=0, # optional
    end=9223372036854775807, # optional
) -> int

text.count(
    'subtring',
    start=0, # optional
    end=9223372036854775807, # optional
    **, # non-traditional arguments:
    allowoverlap=False, # optional
) -> int

text.splitlines(
    keeplinebreaks=False, # optional
    **, # non-traditional arguments:
    separator='\n', # optional
) -> Strs # similar to list[str]

text.split(
    separator=' ', # optional
    maxsplit=9223372036854775807, # optional
    **, # non-traditional arguments:
    keepseparator=False, # optional
) -> Strs # similar to list[str]

Development

rm -rf build && pip install -e . && pytest scripts/test.py -s -x

pip install -e . --no-index --no-deps

To benchmark on some custom file and pattern combinations:

python scripts/bench.py --haystack_path "your file" --needle "your pattern"

To benchmark on synthetic data:

python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

To validate packaging:

cibuildwheel --platform linux

Compiling C++ tests:

brew install libomp llvm
cmake -B ./build_release \
    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
    -DSTRINGZILLA_USE_OPENMP=1 \
    -DSTRINGZILLA_BUILD_TEST=1 \
    && \
    make -C ./build_release -j && ./build_release/stringzilla_test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

stringzilla-1.0.2-cp311-cp311-win_amd64.whl (104.2 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

stringzilla-1.0.2-cp311-cp311-manylinux_2_28_x86_64.whl (248.4 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

stringzilla-1.0.2-cp311-cp311-manylinux_2_28_aarch64.whl (243.2 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ ARM64

stringzilla-1.0.2-cp311-cp311-macosx_11_0_arm64.whl (133.8 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

stringzilla-1.0.2-cp311-cp311-macosx_10_9_x86_64.whl (138.8 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

stringzilla-1.0.2-cp311-cp311-macosx_10_9_universal2.whl (270.4 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-1.0.2-cp310-cp310-win_amd64.whl (103.0 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

stringzilla-1.0.2-cp310-cp310-manylinux_2_28_x86_64.whl (247.3 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

stringzilla-1.0.2-cp310-cp310-manylinux_2_28_aarch64.whl (242.0 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ ARM64

stringzilla-1.0.2-cp310-cp310-macosx_11_0_arm64.whl (132.5 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

stringzilla-1.0.2-cp310-cp310-macosx_10_9_x86_64.whl (137.3 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

stringzilla-1.0.2-cp310-cp310-macosx_10_9_universal2.whl (267.5 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-1.0.2-cp39-cp39-win_amd64.whl (103.1 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

stringzilla-1.0.2-cp39-cp39-manylinux_2_28_x86_64.whl (247.6 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

stringzilla-1.0.2-cp39-cp39-manylinux_2_28_aarch64.whl (242.2 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ ARM64

stringzilla-1.0.2-cp39-cp39-macosx_11_0_arm64.whl (132.7 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

stringzilla-1.0.2-cp39-cp39-macosx_10_9_x86_64.whl (137.4 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

stringzilla-1.0.2-cp39-cp39-macosx_10_9_universal2.whl (267.8 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-1.0.2-cp38-cp38-win_amd64.whl (102.9 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

stringzilla-1.0.2-cp38-cp38-manylinux_2_28_x86_64.whl (247.4 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

stringzilla-1.0.2-cp38-cp38-manylinux_2_28_aarch64.whl (241.8 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ ARM64

stringzilla-1.0.2-cp38-cp38-macosx_11_0_arm64.whl (132.4 kB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

stringzilla-1.0.2-cp38-cp38-macosx_10_9_x86_64.whl (137.1 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

stringzilla-1.0.2-cp38-cp38-macosx_10_9_universal2.whl (267.2 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-1.0.2-cp37-cp37m-win_amd64.whl (102.9 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

stringzilla-1.0.2-cp37-cp37m-manylinux_2_28_x86_64.whl (249.5 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.28+ x86-64

stringzilla-1.0.2-cp37-cp37m-manylinux_2_28_aarch64.whl (244.9 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.28+ ARM64

stringzilla-1.0.2-cp37-cp37m-macosx_10_9_x86_64.whl (132.0 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

stringzilla-1.0.2-cp36-cp36m-win_amd64.whl (102.7 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

stringzilla-1.0.2-cp36-cp36m-manylinux_2_28_x86_64.whl (249.6 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.28+ x86-64

stringzilla-1.0.2-cp36-cp36m-manylinux_2_28_aarch64.whl (244.9 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.28+ ARM64

stringzilla-1.0.2-cp36-cp36m-macosx_10_9_x86_64.whl (131.8 kB view hashes)

Uploaded CPython 3.6m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page