Crunch multi-gigabyte strings with ease
Project description
StringZilla 🦖
StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅
- Python docs
- C docs
- JavaScript docs.
- Rust docs.
Performance
StringZilla uses a heuristic so simple it's almost stupid... but it works.
It matches the first few letters of words with hyper-scalar code to achieve memcpy
speeds.
The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms.
So if you're haunted by open(...).readlines()
and str().splitlines()
taking forever, this should help 😊
Substring Search
Backend \ Device | IoT | Laptop | Server |
---|---|---|---|
Speed Comparison 🐇 | |||
Python for loop |
4 MB/s | 14 MB/s | 11 MB/s |
C++ for loop |
520 MB/s | 1.0 GB/s | 900 MB/s |
C++ string.find |
560 MB/s | 1.2 GB/s | 1.3 GB/s |
Scalar StringZilla | 2 GB/s | 3.3 GB/s | 3.5 GB/s |
Hyper-Scalar StringZilla | 4.3 GB/s | 12 GB/s | 12.1 GB/s |
Efficiency Metrics 📊 | |||
CPU Specs | 8-core ARM, 0.5 W/core | 8-core Intel, 5.6 W/core | 22-core Intel, 6.3 W/core |
Performance/Core | 2.1 - 3.3 GB/s | 11 GB/s | 10.5 GB/s |
Bytes/Joule | 4.2 GB/J | 2 GB/J | 1.6 GB/J |
Partition & Sort
Coming soon.
Quick Start: Python 🐍
1️. Install via pip: pip install stringzilla
2. Import classes: from stringzilla import Str, File, Strs
Basic Usage
StringZilla offers two mostly interchangeable core classes:
from stringzilla import Str, File
text1 = Str('some-string')
text2 = File('some-file.txt')
The Str
is designed to replace long Python str
strings and wrap our C-level API.
On the other hand, the File
memory-maps a file from persistent memory without loading its copy into RAM.
The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously.
A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.
Basic Operations
- Length:
len(text) -> int
- Indexing:
text[42] -> str
- Slicing:
text[42:46] -> str
Advanced Operations
'substring' in text -> bool
text.contains('substring', start=0, end=9223372036854775807) -> bool
text.find('substring', start=0, end=9223372036854775807) -> int
text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs
Collection-Level Operations
Once split into a Strs
object, you can sort, shuffle, and reorganize the slices.
lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)
Need copies?
sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)
Basic list
-like operations are also supported:
lines.append('Pythonic string')
lines.extend(shuffled_copy)
Quick Start: C 🛠️
There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.
#include "stringzilla.h"
// Initialize your haystack and needle
strzl_haystack_t haystack = {your_text, your_text_length};
strzl_needle_t needle = {your_subtext, your_subtext_length, your_anomaly_offset};
// Perform string-level operations
size_t character_count = strzl_naive_count_char(haystack, 'a');
size_t character_position = strzl_naive_find_char(haystack, 'a');
size_t substring_position = strzl_naive_find_substr(haystack, needle);
// Perform collection level operations
strzl_array_t array = {your_order, your_count, your_get_begin, your_get_length, your_handle};
strzl_sort(&array, &your_config);
Contributing 👾
Future development plans include:
- Replace PyBind11 with CPython.
- Reverse-order operations in Python #12.
- Bindings for JavaScript #25, Java, and Rust.
- Faster string sorting algorithm.
- Splitting CSV rows into columns.
- Splitting with multiple separators at once #29.
- UTF-8 validation.
- Arm SVE backend.
Here's how to set up your dev environment and run some tests.
Development
CPython:
# Clean up and install
rm -rf build && pip install -e . && pytest scripts/test.py -s -x
# Install without dependencies
pip install -e . --no-index --no-deps
NodeJS:
npm install && node javascript/test.js
Benchmarking
To benchmark on some custom file and pattern combinations:
python scripts/bench.py --haystack_path "your file" --needle "your pattern"
To benchmark on synthetic data:
python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"
Packaging
To validate packaging:
cibuildwheel --platform linux
Compiling C++ Tests
# Install dependencies
brew install libomp llvm
# Compile and run tests
cmake -B ./build_release \
-DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
-DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
-DSTRINGZILLA_USE_OPENMP=1 \
-DSTRINGZILLA_BUILD_TEST=1 \
&& \
make -C ./build_release -j && ./build_release/stringzilla_test
License 📜
Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.
If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack 🤗
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for stringzilla-1.2.2-cp312-cp312-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5994154abeb2acd74ba5bdc10d3d23256a3198e211a1f01023e2b3fc47013cdd |
|
MD5 | 7cf81d47cfc5cd54427c3fcd3641411a |
|
BLAKE2b-256 | 40b8db825ba2986a252c2fc827dfe60ce75093359e78cff5a1730d28fe86da42 |
Hashes for stringzilla-1.2.2-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 373af9096a574efac6aad78a9e11eac4aa2e92548db4275ecf1863b6db805bb4 |
|
MD5 | b381983d4172eef2b699a2b4a97a8055 |
|
BLAKE2b-256 | b847373c58f32c925589d7f4707f5a8cac59ecb405ebce66ccb8e2774fd82d1d |
Hashes for stringzilla-1.2.2-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0cf8fe9d70d94c68fc18c2cb31d6e994c4df720cd1f06f177e10a7617e74f68b |
|
MD5 | 9b2841e7e38536ce7ff4a36c8a394e86 |
|
BLAKE2b-256 | b884c6590393aaa455eec4ee812f441510de4f158a5458421285a79126bbdb6e |
Hashes for stringzilla-1.2.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5955657783561211180edc181d38f7b4276ff51594a8a6465b148effdeaac34f |
|
MD5 | 632ec71f201299dd13676c6d4f3025cf |
|
BLAKE2b-256 | 90fb3698cdf553ce5fccfaccbd346569ea18220db93a3f1367f9112ed7f0dae4 |
Hashes for stringzilla-1.2.2-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92905e8af0c5cba62423e4d1c9202c78839175b285273ae9f3554b6237f87c5c |
|
MD5 | f9a24006c999766230b41dd456ed0dbb |
|
BLAKE2b-256 | 9f9dfcf356188412a8f79351039a35db652a7011c4d64e522036d341e08b5b3e |
Hashes for stringzilla-1.2.2-cp312-cp312-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a31222d1eadc42ae79abec4b830ce682a77cd396db1ab73ffed1af25b53cb539 |
|
MD5 | 399a5954233d21201aeec177def34f6e |
|
BLAKE2b-256 | 69c2918a6b913bfb30611bf2e97b2c8883ad258e4790a74dca7abf24afda19fc |
Hashes for stringzilla-1.2.2-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37008738d50845fdc89e86e6a212bf7126f117ccc7529cacf06e914c98b3c8dd |
|
MD5 | d3a35ed4a96c00e470c54caabdf4e819 |
|
BLAKE2b-256 | f9a4d92c5b2444c454e10b384f7e9863954b7b11267c1ab3d9ad1339d2f69190 |
Hashes for stringzilla-1.2.2-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18488a0a7d0a4a0cd34d8610bd8b6d5d77a1a76ef683fdec7ab03b2323cf7fe6 |
|
MD5 | b4d848eb7eead6ec07819c1b27bacf72 |
|
BLAKE2b-256 | f3450842c389f33748f79717a8e62bc4e4c3ee303a69682d29aba92759c86040 |
Hashes for stringzilla-1.2.2-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99fff5910d9364e359393aab277ce2f02eebfccb0c33d384d37f1c23e2d7451e |
|
MD5 | e6e63e514e7d2913975a09cd468e0b28 |
|
BLAKE2b-256 | 44864a5e13270697289cd71287732b2c2e1e7683f7d5c3c754b0f24a738d96f0 |
Hashes for stringzilla-1.2.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 399a379f05661a4038eb6a06cd2d3eacceda8b946dfaa76f888269755ffb5745 |
|
MD5 | 76ca9808451366543f7aecee967217ec |
|
BLAKE2b-256 | 68e977af6b7458e34b2ba625bf967641cdab326fc924358edefb721f392511fa |
Hashes for stringzilla-1.2.2-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5261636e240d8158a34766bdf7ca7a00e2d9056c0e1078520f17cfe89db367d9 |
|
MD5 | 8e4530f75e0e22825f6b4c03b663a745 |
|
BLAKE2b-256 | f0fea201458300774b9b3fa69c752d3a166fb9f0a20e3d35f7bce2cc428f84fe |
Hashes for stringzilla-1.2.2-cp311-cp311-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 361ade69f9c2a35c92c7609a65a7ca5f14dda7f35309ba7e13a791b3579d093c |
|
MD5 | e601a8fc698bd40c0affa3ef0401455c |
|
BLAKE2b-256 | 4bfbce43e91ac97db2c312a0f2c6682a34900990a2636a609bdfc0f7b72f2295 |
Hashes for stringzilla-1.2.2-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f88d2e6bc3c758f35686e1b40830d07c4e91ee6bda563c3a6f0e1463a05588fc |
|
MD5 | f6c55b2f719a1d8759d8fcf509bdc791 |
|
BLAKE2b-256 | c71869993ef155b774066a4842bd1e2c8875f4397db6d442ccf8cba49c9c96d0 |
Hashes for stringzilla-1.2.2-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 27bc4f2b3a129b1b3498bae9bbb812ee245fe9975d2690cc7bbf11c4cf4f5b92 |
|
MD5 | 5001e1f62b35fc7583861e3e570562ed |
|
BLAKE2b-256 | 9d92c06c8589e08ebc7fcfb39a714e65264f710f2bb1a211398d6e55540acc7f |
Hashes for stringzilla-1.2.2-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a3eaf8a039a29e6ce66efc05f32d43923daa23222f31354153da31a6325e6b9 |
|
MD5 | 1d37a6102640975952ea07cf10419661 |
|
BLAKE2b-256 | e89e2d9fbb80ce0b3782d8eebc06d81648751e304cf5ea5c6c948f34ff3c4c63 |
Hashes for stringzilla-1.2.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 369ea1973b5a9c3208dfe957fc0e4d77441b466483763217861e56c0ba591d58 |
|
MD5 | aad4e21656baec12f7362f233896ab85 |
|
BLAKE2b-256 | dd92182e9b0ce46495cc9c839e9b827529c6bf4c39c144c0c7d35a76012effd8 |
Hashes for stringzilla-1.2.2-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82fe83c85e64815ce05f4ea05b89e665841e602f65577ff09cd13484d69b763a |
|
MD5 | e789a262dde4f0519959963e73f96cdd |
|
BLAKE2b-256 | c36cbbf3f703b89d3210ad4d63603b5570062ebd5c5c9d6708c42e25c4d49f64 |
Hashes for stringzilla-1.2.2-cp310-cp310-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94cd234576c10b6f25980bcb647be2d3e51eeafacd68e0162e5c1266a33718ac |
|
MD5 | 026efb4e42a5d94e5e29de46bcb0bb7e |
|
BLAKE2b-256 | 59d616105dbab4602295a3b1a3537cb16970935795e8b3e0dba41a7ffe586bd9 |
Hashes for stringzilla-1.2.2-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3289e48c2d30adc7850d877f5d6a148d937de783b844e819b7c504b36aa5f23b |
|
MD5 | fa813c69cb25cf0cd5e0888d2c4860d7 |
|
BLAKE2b-256 | 87c34b4184579f68e65a9d77f7992dbb732b0f0d718e0ce732cba18887b6dd59 |
Hashes for stringzilla-1.2.2-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9e6fdf532239355a8d34eb5a5b4d90dd0153e5a6833885de2b386b20908ad2c |
|
MD5 | 357fd8910d1e789022347832e9d8ea3b |
|
BLAKE2b-256 | f3dd107fdd1283700db98591c494e26d8e2a697dc398a8c3d6ad99140e78bf9a |
Hashes for stringzilla-1.2.2-cp39-cp39-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24a685b6c5eef052023ccf350b3bd3fcb9865d7471e816f1f2e67de19aca29ba |
|
MD5 | 7ca0c77dfdc8b49a7fcf7a7e1d8829ed |
|
BLAKE2b-256 | 11ac01a5557fe30662a95b36e68910ea20813ea57300bdd4bc183ad4aa01c26d |
Hashes for stringzilla-1.2.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e67a451da00411d6914609346bb15499cb9b5f87f08eada5f87fec60a05e0a1 |
|
MD5 | 94b5a5b0c5d35fa6fdc9f225d3315a55 |
|
BLAKE2b-256 | 4746873d6bd6aad4b539b8cd47b92c93fe22d756968d0d278d2bedda42ad6fb8 |
Hashes for stringzilla-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83f35fabdea9952cf3ed593b1bc68a405e104e6bf56b5ac7168a650458f6c12e |
|
MD5 | 926aa6fe2bad220c26aa2765c5488c18 |
|
BLAKE2b-256 | 8e50a045fbfdeba76869c780e30eba4b5674e6084560e6d44ffa5548e186ebe7 |
Hashes for stringzilla-1.2.2-cp39-cp39-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a77be7d5565aeef4582d5a6f38f25c02524550456c59bd9813fc709f176315e |
|
MD5 | 2481afd2699ffa4bcbe6ab0ccce3496f |
|
BLAKE2b-256 | 8457ee3197557b38ffb14715c1c072672049314c3704f677f692df05e14a0e23 |
Hashes for stringzilla-1.2.2-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5977d4295bb0f872d6cb611890baab33abd784fb93643c90fb220fe88e85fbdd |
|
MD5 | 064e1c3e9558145a1798279297f66ab1 |
|
BLAKE2b-256 | dd10d9b268dbc592a1609e03fc8267f54e0c998d1632a3a9423bfeb76ab1919b |
Hashes for stringzilla-1.2.2-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b1de8c3b6100a38cf0691bb15b63a32515a69ab44cf1bb863aec85ade6fbb4d1 |
|
MD5 | cfb9646846c5890d591a8885ff68fb76 |
|
BLAKE2b-256 | 278b663d5429f20d61837bdfe56f0b9dc3efb8d3ca6ef4b2d77e18c53ec2a08f |
Hashes for stringzilla-1.2.2-cp38-cp38-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed708f76f17a48eb1c9193b0057102d6d55d3e0bbdb7dc1154eb4020b538e6f5 |
|
MD5 | 513090b9383e5e5b4e2da2acb810312f |
|
BLAKE2b-256 | 30450d4acce5baaf68d6cd4d3fadd3bd2d4aa7b802c73e35e1dafe7a142a8b04 |
Hashes for stringzilla-1.2.2-cp38-cp38-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8555805c56f13fd96c5df3cbedcd8c732fdca76d9ac0b75f7b3cb9db9f00e00 |
|
MD5 | bb3160308900d9fdac39f18923d18f5b |
|
BLAKE2b-256 | 3520a40cfa1fe49b6a1c29a15abcce3a6aa9d6a80a9518a17d3af433567b534d |
Hashes for stringzilla-1.2.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab23b357c207fb64290feef7b3d1b18c1cdb28884414986245c2c3bbcc32cb51 |
|
MD5 | 532c35a562380cad44dc8f8286645d45 |
|
BLAKE2b-256 | 81fd00432fdb6549909f3da5d7e2cff94852e8e9953c1d7b3072e6baa0045807 |
Hashes for stringzilla-1.2.2-cp38-cp38-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62ad7f4680eccdd64d5df03cef2642da7e1f8e01b01bda4184b4bc331187787e |
|
MD5 | 9f883a166939a53355da99a7bf9e64bc |
|
BLAKE2b-256 | 6307d54c342b5b938f2d3f3a38576f9cdfad88d670e67776d39bd3b7d42d0c1e |
Hashes for stringzilla-1.2.2-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c52c1b12148b0b9aec8f2d7dbda9e0c1861e38b741f02ae14e9a75a076cdad1e |
|
MD5 | c542da3ab2eecc414a5ce284314cb212 |
|
BLAKE2b-256 | 84c8560c39ce0b28b8525387eb73f384af0b439597c220047410d8eaf236d6bd |
Hashes for stringzilla-1.2.2-cp37-cp37m-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc580a9c01705426eee29072cd2ec2b78aa44a12d56d72160129fda7260591e5 |
|
MD5 | 15872d03f1518268afbc2075e1bf5d4a |
|
BLAKE2b-256 | 70b371f4212e1239a932c45b04ad1cb1dadb2a4e414d17077a0272e0af58a6bd |
Hashes for stringzilla-1.2.2-cp37-cp37m-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc8dbcf0438237d2ef063a835236450361ffd66ba6ae5f56838c6a9ee16ddc7d |
|
MD5 | 6b725de7a1a94a61abdb0f9b4e40653e |
|
BLAKE2b-256 | 260988d5a8106a472733983d04f1367aae4c1bde35b51f1beb508827da395789 |
Hashes for stringzilla-1.2.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0709a567b943787e3bd4155df0482d7011b2d4c625f32ad711761dfbeadc1c76 |
|
MD5 | a56db6317b9573fc18ed38fc7b1a2346 |
|
BLAKE2b-256 | f8608e5de7a78b61fb7cc6451146effde5ee7d7932c2689c94864360fa63abb6 |
Hashes for stringzilla-1.2.2-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | af84e42c6a7e1b9078550018a9497057960a4e424b153c4e660cb0a8f375fe21 |
|
MD5 | 5eb6b5d5de79636376991db67484cdd4 |
|
BLAKE2b-256 | afca862f4262ea88a83be1378a779ff7d5b5b78afa4da9579107e04e5766d0da |
Hashes for stringzilla-1.2.2-cp36-cp36m-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9a08df6d4f68967b13bf75e79b2226c0a30a0466e4e00278df0a31187630dc1 |
|
MD5 | 21ff699893fc26cc814e7352561a4562 |
|
BLAKE2b-256 | 301bd1b7401ae71a244ae4bb1c14e7498ea58ee11a0b91eef38a44686592ede0 |
Hashes for stringzilla-1.2.2-cp36-cp36m-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ba8a1d9e9782438794785f023fe5e3149aa7d56040016fde4cdfb3b9a1a0781 |
|
MD5 | c4b941f31631c1bef1cef1cdd0b396d6 |
|
BLAKE2b-256 | ba13b8c41d08007b8d3d42a80629030df339d756de5a500d57b950eef0f4aad1 |
Hashes for stringzilla-1.2.2-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53ac5057648ce6be4eaa5aef2ebb9971f959672e2e5458481a4623ef82d2a23b |
|
MD5 | 1c7c3df608b7b80649660ab53cb7300c |
|
BLAKE2b-256 | 5106403e46fddc8b4d22586d7c6f19378eac34b4be232943dfc1fd39c0ffb392 |