Skip to main content

Crunch 100+ GB Strings in Python with ease

Project description

StringZilla 🦖

StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅

Performance

StringZilla uses a heuristic so simple it's almost stupid... but it works. It matches the first few letters of words with hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms. So if you're haunted by open(...).readlines() and str().splitlines() taking forever, this should help 😊

Substring Search

Backend \ Device IoT Laptop Server
Speed Comparison 🐇
Python for loop 4 MB/s 14 MB/s 11 MB/s
C++ for loop 520 MB/s 1.0 GB/s 900 MB/s
C++ string.find 560 MB/s 1.2 GB/s 1.3 GB/s
Scalar StringZilla 2 GB/s 3.3 GB/s 3.5 GB/s
Hyper-Scalar StringZilla 4.3 GB/s 12 GB/s 12.1 GB/s
Efficiency Metrics 📊
CPU Specs 8-core ARM, 0.5 W/core 8-core Intel, 5.6 W/core 22-core Intel, 6.3 W/core
Performance/Core 2.1 - 3.3 GB/s 11 GB/s 10.5 GB/s
Bytes/Joule 4.2 GB/J 2 GB/J 1.6 GB/J

Partition & Sort

Coming soon.

Quick Start: Python 🐍

1️. Install via pip: pip install stringzilla
2. Import classes: from stringzilla import Str, File, Strs

Basic Usage

StringZilla offers two mostly interchangeable core classes:

from stringzilla import Str, File

text1 = Str('some-string')
text2 = File('some-file.txt')

The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Basic Operations

  • Length: len(text) -> int
  • Indexing: text[42] -> str
  • Slicing: text[42:46] -> str

Advanced Operations

  • 'substring' in text -> bool
  • text.contains('substring', start=0, end=9223372036854775807) -> bool
  • text.find('substring', start=0, end=9223372036854775807) -> int
  • text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
  • text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
  • text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs

Collection-Level Operations

Once split into a Strs object, you can sort, shuffle, and reorganize the slices.

lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)

Need copies?

sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)

Basic list-like operations are also supported:

lines.append('Pythonic string')
lines.extend(shuffled_copy)

Quick Start: C 🛠️

There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.

#include "stringzilla.h"

// Initialize your haystack and needle
strzl_haystack_t haystack = {your_text, your_text_length};
strzl_needle_t needle = {your_subtext, your_subtext_length, your_anomaly_offset};

// Perform string-level operations
size_t character_count = strzl_naive_count_char(haystack, 'a');
size_t character_position = strzl_naive_find_char(haystack, 'a');
size_t substring_position = strzl_naive_find_substr(haystack, needle);

// Perform collection level operations
strzl_array_t array = {your_order, your_count, your_get_begin, your_get_length, your_handle};
strzl_sort(&array, &your_config);

Contributing 👾

Future development plans include:

  • Faster string sorting algorithm.
  • Bindings for JavaScript, Java, and Rust.
  • Support for reverse-order operations in Python.
  • Splitting CSV rows into columns.
  • Arm SVE backend.

Here's how to set up your dev environment and run some tests.

Development

# Clean up and install
rm -rf build && pip install -e . && pytest scripts/test.py -s -x

# Install without dependencies
pip install -e . --no-index --no-deps

Benchmarking

To benchmark on some custom file and pattern combinations:

python scripts/bench.py --haystack_path "your file" --needle "your pattern"

To benchmark on synthetic data:

python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

Packaging

To validate packaging:

cibuildwheel --platform linux

Compiling C++ Tests

# Install dependencies
brew install libomp llvm

# Compile and run tests
cmake -B ./build_release \
    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
    -DSTRINGZILLA_USE_OPENMP=1 \
    -DSTRINGZILLA_BUILD_TEST=1 \
    && \
    make -C ./build_release -j && ./build_release/stringzilla_test

License 📜

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.


If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack 🤗

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

stringzilla-1.1.3-cp312-cp312-win_amd64.whl (121.0 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

stringzilla-1.1.3-cp312-cp312-manylinux_2_28_x86_64.whl (275.3 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.28+ x86-64

stringzilla-1.1.3-cp312-cp312-manylinux_2_28_aarch64.whl (265.6 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.28+ ARM64

stringzilla-1.1.3-cp312-cp312-macosx_11_0_arm64.whl (156.7 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

stringzilla-1.1.3-cp312-cp312-macosx_10_9_x86_64.whl (162.1 kB view hashes)

Uploaded CPython 3.12 macOS 10.9+ x86-64

stringzilla-1.1.3-cp312-cp312-macosx_10_9_universal2.whl (311.5 kB view hashes)

Uploaded CPython 3.12 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-1.1.3-cp311-cp311-win_amd64.whl (121.2 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

stringzilla-1.1.3-cp311-cp311-manylinux_2_28_x86_64.whl (275.4 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

stringzilla-1.1.3-cp311-cp311-manylinux_2_28_aarch64.whl (268.2 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ ARM64

stringzilla-1.1.3-cp311-cp311-macosx_11_0_arm64.whl (157.4 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

stringzilla-1.1.3-cp311-cp311-macosx_10_9_x86_64.whl (160.7 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

stringzilla-1.1.3-cp311-cp311-macosx_10_9_universal2.whl (310.6 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-1.1.3-cp310-cp310-win_amd64.whl (120.1 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

stringzilla-1.1.3-cp310-cp310-manylinux_2_28_x86_64.whl (274.2 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

stringzilla-1.1.3-cp310-cp310-manylinux_2_28_aarch64.whl (266.1 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ ARM64

stringzilla-1.1.3-cp310-cp310-macosx_11_0_arm64.whl (156.1 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

stringzilla-1.1.3-cp310-cp310-macosx_10_9_x86_64.whl (159.3 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

stringzilla-1.1.3-cp310-cp310-macosx_10_9_universal2.whl (307.9 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-1.1.3-cp39-cp39-win_amd64.whl (120.2 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

stringzilla-1.1.3-cp39-cp39-manylinux_2_28_x86_64.whl (275.2 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

stringzilla-1.1.3-cp39-cp39-manylinux_2_28_aarch64.whl (266.1 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ ARM64

stringzilla-1.1.3-cp39-cp39-macosx_11_0_arm64.whl (156.3 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

stringzilla-1.1.3-cp39-cp39-macosx_10_9_x86_64.whl (159.5 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

stringzilla-1.1.3-cp39-cp39-macosx_10_9_universal2.whl (308.4 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-1.1.3-cp38-cp38-win_amd64.whl (120.1 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

stringzilla-1.1.3-cp38-cp38-manylinux_2_28_x86_64.whl (274.3 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

stringzilla-1.1.3-cp38-cp38-manylinux_2_28_aarch64.whl (265.9 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ ARM64

stringzilla-1.1.3-cp38-cp38-macosx_11_0_arm64.whl (155.8 kB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

stringzilla-1.1.3-cp38-cp38-macosx_10_9_x86_64.whl (159.3 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

stringzilla-1.1.3-cp38-cp38-macosx_10_9_universal2.whl (307.7 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-1.1.3-cp37-cp37m-win_amd64.whl (119.7 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

stringzilla-1.1.3-cp37-cp37m-manylinux_2_28_x86_64.whl (276.3 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.28+ x86-64

stringzilla-1.1.3-cp37-cp37m-manylinux_2_28_aarch64.whl (269.3 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.28+ ARM64

stringzilla-1.1.3-cp37-cp37m-macosx_10_9_x86_64.whl (154.9 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

stringzilla-1.1.3-cp36-cp36m-win_amd64.whl (119.6 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

stringzilla-1.1.3-cp36-cp36m-manylinux_2_28_x86_64.whl (276.2 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.28+ x86-64

stringzilla-1.1.3-cp36-cp36m-manylinux_2_28_aarch64.whl (269.5 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.28+ ARM64

stringzilla-1.1.3-cp36-cp36m-macosx_10_9_x86_64.whl (154.8 kB view hashes)

Uploaded CPython 3.6m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page