Skip to main content

Crunch multi-gigabyte strings with ease

Project description

StringZilla 🦖

StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅

Performance

StringZilla uses a heuristic so simple it's almost stupid... but it works. It matches the first few letters of words with hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms. So if you're haunted by open(...).readlines() and str().splitlines() taking forever, this should help 😊

Substring Search

Backend \ Device IoT Laptop Server
Speed Comparison 🐇
Python for loop 4 MB/s 14 MB/s 11 MB/s
C++ for loop 520 MB/s 1.0 GB/s 900 MB/s
C++ string.find 560 MB/s 1.2 GB/s 1.3 GB/s
Scalar StringZilla 2 GB/s 3.3 GB/s 3.5 GB/s
Hyper-Scalar StringZilla 4.3 GB/s 12 GB/s 12.1 GB/s
Efficiency Metrics 📊
CPU Specs 8-core ARM, 0.5 W/core 8-core Intel, 5.6 W/core 22-core Intel, 6.3 W/core
Performance/Core 2.1 - 3.3 GB/s 11 GB/s 10.5 GB/s
Bytes/Joule 4.2 GB/J 2 GB/J 1.6 GB/J

Partition & Sort

Coming soon.

Quick Start: Python 🐍

  1. Install via pip: pip install stringzilla
  2. Import the classes you need: from stringzilla import Str, Strs, File

Basic Usage

StringZilla offers two mostly interchangeable core classes:

from stringzilla import Str, File

text_from_str = Str('some-string')
text_from_file = Str(File('some-file.txt'))

The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Basic Operations

  • Length: len(text) -> int
  • Indexing: text[42] -> str
  • Slicing: text[42:46] -> Str
  • String conversion: str(text) -> str
  • Substring check: 'substring' in text -> bool
  • Hashing: hash(text) -> int

Advanced Operations

  • text.contains('substring', start=0, end=9223372036854775807) -> bool
  • text.find('substring', start=0, end=9223372036854775807) -> int
  • text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
  • text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
  • text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs

Collection-Level Operations

Once split into a Strs object, you can sort, shuffle, and reorganize the slices.

lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)

Need copies?

sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)

Basic list-like operations are also supported:

lines.append('Pythonic string')
lines.extend(shuffled_copy)

Low-Level Python API

The StringZilla CPython bindings implement vector-call conventions for faster calls.

import stringzilla as sz

contains: bool = sz.contains("haystack", "needle", start=0, end=9223372036854775807)
offset: int = sz.find("haystack", "needle", start=0, end=9223372036854775807)
count: int = sz.count("haystack", "needle", start=0, end=9223372036854775807, allowoverlap=False)
levenstein: int = sz.levenstein("needle", "nidl")

Quick Start: C 🛠️

There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.

#include "stringzilla.h"

// Initialize your haystack and needle
sz_string_view_t haystack = {your_text, your_text_length};
sz_string_view_t needle = {your_subtext, your_subtext_length};

// Perform string-level operations
sz_size_t character_count = sz_count_char(haystack.start, haystack.length, "a");
sz_size_t substring_position = sz_find_substring(haystack.start, haystack.length, needle.start, needle.length);

// Hash strings
sz_u32_t crc32 = sz_hash_crc32(haystack.start, haystack.length);

// Perform collection level operations
sz_sequence_t array = {your_order, your_count, your_get_start, your_get_length, your_handle};
sz_sort(&array, &your_config);

Contributing 👾

Future development plans include:

Here's how to set up your dev environment and run some tests.

Development

CPython:

# Clean up, install, and test!
rm -rf build && pip install -e . && pytest scripts/ -s -x

# Install without dependencies
pip install -e . --no-index --no-deps

NodeJS:

npm install && npm test

Benchmarking

To benchmark on some custom file and pattern combinations:

python scripts/bench.py --haystack_path "your file" --needle "your pattern"

To benchmark on synthetic data:

python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

Packaging

To validate packaging:

cibuildwheel --platform linux

Compiling C++ Tests

cmake -B ./build_release -DSTRINGZILLA_BUILD_TEST=1 && make -C ./build_release -j && ./build_release/stringzilla_test

On MacOS it's recommended to use non-default toolchain:

# Install dependencies
brew install libomp llvm

# Compile and run tests
cmake -B ./build_release \
    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
    -DSTRINGZILLA_USE_OPENMP=1 \
    -DSTRINGZILLA_BUILD_TEST=1 \
    && \
    make -C ./build_release -j && ./build_release/stringzilla_test

License 📜

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.


If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack 🤗

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

stringzilla-2.0.1-cp312-cp312-win_amd64.whl (31.3 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

stringzilla-2.0.1-cp312-cp312-manylinux_2_28_x86_64.whl (203.4 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.28+ x86-64

stringzilla-2.0.1-cp312-cp312-manylinux_2_28_aarch64.whl (201.2 kB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.28+ ARM64

stringzilla-2.0.1-cp312-cp312-macosx_11_0_arm64.whl (27.3 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

stringzilla-2.0.1-cp312-cp312-macosx_10_9_x86_64.whl (29.5 kB view hashes)

Uploaded CPython 3.12 macOS 10.9+ x86-64

stringzilla-2.0.1-cp312-cp312-macosx_10_9_universal2.whl (48.2 kB view hashes)

Uploaded CPython 3.12 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-2.0.1-cp311-cp311-win_amd64.whl (31.2 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

stringzilla-2.0.1-cp311-cp311-manylinux_2_28_x86_64.whl (202.4 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

stringzilla-2.0.1-cp311-cp311-manylinux_2_28_aarch64.whl (200.8 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ ARM64

stringzilla-2.0.1-cp311-cp311-macosx_11_0_arm64.whl (27.2 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

stringzilla-2.0.1-cp311-cp311-macosx_10_9_x86_64.whl (29.4 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

stringzilla-2.0.1-cp311-cp311-macosx_10_9_universal2.whl (48.0 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-2.0.1-cp310-cp310-win_amd64.whl (30.3 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

stringzilla-2.0.1-cp310-cp310-manylinux_2_28_x86_64.whl (201.5 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

stringzilla-2.0.1-cp310-cp310-manylinux_2_28_aarch64.whl (199.8 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ ARM64

stringzilla-2.0.1-cp310-cp310-macosx_11_0_arm64.whl (27.2 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

stringzilla-2.0.1-cp310-cp310-macosx_10_9_x86_64.whl (29.4 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

stringzilla-2.0.1-cp310-cp310-macosx_10_9_universal2.whl (48.0 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-2.0.1-cp39-cp39-win_amd64.whl (31.3 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

stringzilla-2.0.1-cp39-cp39-manylinux_2_28_x86_64.whl (200.8 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

stringzilla-2.0.1-cp39-cp39-manylinux_2_28_aarch64.whl (198.9 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ ARM64

stringzilla-2.0.1-cp39-cp39-macosx_11_0_arm64.whl (27.2 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

stringzilla-2.0.1-cp39-cp39-macosx_10_9_x86_64.whl (29.4 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

stringzilla-2.0.1-cp39-cp39-macosx_10_9_universal2.whl (48.0 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-2.0.1-cp38-cp38-win_amd64.whl (30.4 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

stringzilla-2.0.1-cp38-cp38-manylinux_2_28_x86_64.whl (200.0 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

stringzilla-2.0.1-cp38-cp38-manylinux_2_28_aarch64.whl (198.1 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ ARM64

stringzilla-2.0.1-cp38-cp38-macosx_11_0_arm64.whl (27.2 kB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

stringzilla-2.0.1-cp38-cp38-macosx_10_9_x86_64.whl (29.4 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

stringzilla-2.0.1-cp38-cp38-macosx_10_9_universal2.whl (48.0 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64)

stringzilla-2.0.1-cp37-cp37m-win_amd64.whl (30.3 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

stringzilla-2.0.1-cp37-cp37m-manylinux_2_28_x86_64.whl (198.5 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.28+ x86-64

stringzilla-2.0.1-cp37-cp37m-manylinux_2_28_aarch64.whl (196.2 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.28+ ARM64

stringzilla-2.0.1-cp37-cp37m-macosx_10_9_x86_64.whl (29.2 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

stringzilla-2.0.1-cp36-cp36m-win_amd64.whl (30.2 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

stringzilla-2.0.1-cp36-cp36m-manylinux_2_28_x86_64.whl (198.5 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.28+ x86-64

stringzilla-2.0.1-cp36-cp36m-manylinux_2_28_aarch64.whl (196.2 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.28+ ARM64

stringzilla-2.0.1-cp36-cp36m-macosx_10_9_x86_64.whl (29.2 kB view hashes)

Uploaded CPython 3.6m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page