Crunch 100+ GB Strings in Python with ease
Project description
StringZilla 🦖
StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅
- Python docs
- C docs
- JavaScript docs.
- Rust docs.
Performance
StringZilla uses a heuristic so simple it's almost stupid... but it works.
It matches the first few letters of words with hyper-scalar code to achieve memcpy
speeds.
The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms.
So if you're haunted by open(...).readlines()
and str().splitlines()
taking forever, this should help 😊
Substring Search
Backend \ Device | IoT | Laptop | Server |
---|---|---|---|
Speed Comparison 🐇 | |||
Python for loop |
4 MB/s | 14 MB/s | 11 MB/s |
C++ for loop |
520 MB/s | 1.0 GB/s | 900 MB/s |
C++ string.find |
560 MB/s | 1.2 GB/s | 1.3 GB/s |
Scalar StringZilla | 2 GB/s | 3.3 GB/s | 3.5 GB/s |
Hyper-Scalar StringZilla | 4.3 GB/s | 12 GB/s | 12.1 GB/s |
Efficiency Metrics 📊 | |||
CPU Specs | 8-core ARM, 0.5 W/core | 8-core Intel, 5.6 W/core | 22-core Intel, 6.3 W/core |
Performance/Core | 2.1 - 3.3 GB/s | 11 GB/s | 10.5 GB/s |
Bytes/Joule | 4.2 GB/J | 2 GB/J | 1.6 GB/J |
Partition & Sort
Coming soon.
Quick Start: Python 🐍
1️. Install via pip: pip install stringzilla
2. Import classes: from stringzilla import Str, File, Strs
Basic Usage
StringZilla offers two mostly interchangeable core classes:
from stringzilla import Str, File
text1 = Str('some-string')
text2 = File('some-file.txt')
The Str
is designed to replace long Python str
strings and wrap our C-level API.
On the other hand, the File
memory-maps a file from persistent memory without loading its copy into RAM.
The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously.
A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.
Basic Operations
- Length:
len(text) -> int
- Indexing:
text[42] -> str
- Slicing:
text[42:46] -> str
Advanced Operations
'substring' in text -> bool
text.contains('substring', start=0, end=9223372036854775807) -> bool
text.find('substring', start=0, end=9223372036854775807) -> int
text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs
Collection-Level Operations
Once split into a Strs
object, you can sort, shuffle, and reorganize the slices.
lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)
Need copies?
sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)
Basic list
-like operations are also supported:
lines.append('Pythonic string')
lines.extend(shuffled_copy)
Quick Start: C 🛠️
There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.
#include "stringzilla.h"
// Initialize your haystack and needle
strzl_haystack_t haystack = {your_text, your_text_length};
strzl_needle_t needle = {your_subtext, your_subtext_length, your_anomaly_offset};
// Perform string-level operations
size_t character_count = strzl_naive_count_char(haystack, 'a');
size_t character_position = strzl_naive_find_char(haystack, 'a');
size_t substring_position = strzl_naive_find_substr(haystack, needle);
// Perform collection level operations
strzl_array_t array = {your_order, your_count, your_get_begin, your_get_length, your_handle};
strzl_sort(&array, &your_config);
Contributing 👾
Future development plans include:
- Faster string sorting algorithm.
- Bindings for JavaScript, Java, and Rust.
- Support for reverse-order operations in Python.
- Splitting CSV rows into columns.
- Arm SVE backend.
Here's how to set up your dev environment and run some tests.
Development
# Clean up and install
rm -rf build && pip install -e . && pytest scripts/test.py -s -x
# Install without dependencies
pip install -e . --no-index --no-deps
Benchmarking
To benchmark on some custom file and pattern combinations:
python scripts/bench.py --haystack_path "your file" --needle "your pattern"
To benchmark on synthetic data:
python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"
Packaging
To validate packaging:
cibuildwheel --platform linux
Compiling C++ Tests
# Install dependencies
brew install libomp llvm
# Compile and run tests
cmake -B ./build_release \
-DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
-DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
-DSTRINGZILLA_USE_OPENMP=1 \
-DSTRINGZILLA_BUILD_TEST=1 \
&& \
make -C ./build_release -j && ./build_release/stringzilla_test
License 📜
Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.
If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack 🤗
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for stringzilla-1.1.3-cp312-cp312-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3911fd207260ce3315a515c559e1015cfe2473499ac7d3b752f161735f34152c |
|
MD5 | 301382cd88116a8d56106a47ec2428f3 |
|
BLAKE2b-256 | f4ebe90cfa4a8ab518c0efbcdaec9de0905a468ebe6959b29cff62198784afdf |
Hashes for stringzilla-1.1.3-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7245eaece58765267c6e73a79ea23902600e16ecfdf2f9f4398c2a8f5a121a35 |
|
MD5 | e77d3de075a02b79d3fbcbc796aa5bbb |
|
BLAKE2b-256 | 10cf6837475dc01fb9d01a6522502a1c911fbc6fa140f8bccbcbbc48721638d1 |
Hashes for stringzilla-1.1.3-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60daa1f9ab3b26cc639ee3a7808b47a986f957809b5989154a4829cf183a776f |
|
MD5 | 424f76020a7f17e86018219169317f4e |
|
BLAKE2b-256 | dc19eacdc15bbc2fb4e420e60df233522742a083a3393d8a141b470ef51b1e81 |
Hashes for stringzilla-1.1.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab6f9e6227c5b39fba27ad80a6467ab10fc3c26cd5af91af2e705d70779011ea |
|
MD5 | c17306bf5f75fa9026a44141199653ca |
|
BLAKE2b-256 | 20524b1e4649e6dcc933902a3d46d403994bc785bf7c4db9e955b9c12f32663b |
Hashes for stringzilla-1.1.3-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84030982394c2c88715f3358afef7a24749e5a059506e69f02233f3433ae9164 |
|
MD5 | 470f9dd51273042750aa3916d894a683 |
|
BLAKE2b-256 | e18eb5e01a233e92e221855d9cfcba6766e39e9aec2b8af4fa85e5a4e1c82b3f |
Hashes for stringzilla-1.1.3-cp312-cp312-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e0ca84d65a818c515bff21be45b730735ab78526504d5dda9bfc3f4e50a0848 |
|
MD5 | 13feb2be553e4e3b223f4ab456f2994f |
|
BLAKE2b-256 | 948abf1526ca97bc09f315d96c674ed0e07c8e378f8a71ad9d40fe90778446fd |
Hashes for stringzilla-1.1.3-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ac26076aa4e14d83ff0dda6b9da4068662fd6dcb9186b11f7cdc5e53eb07446 |
|
MD5 | 4a99d3eda598a83cff92bda6548d1a1d |
|
BLAKE2b-256 | 80ee8d640159f19b6f2b4f6e8e3ccaf3dd4ed22911b15fb4e90e1bac8db9acfe |
Hashes for stringzilla-1.1.3-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b0f91bf27f8ac6a06bb57fbd974b316b2d271cf58db53c667eaf3fd28c52deb |
|
MD5 | 3b8c16f8c4f30972c6c285ee08b05a98 |
|
BLAKE2b-256 | 9ea6f53885887f2738a37a5887b6309fe83e01a6effb183fb12223bec954a2e8 |
Hashes for stringzilla-1.1.3-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c9edd20c48e440f6e23a061209865bed7c197a0fe1fd0441af2556ad1baacff |
|
MD5 | 5823072a8f74a7934c259d56b3cad95b |
|
BLAKE2b-256 | 4869dc9b095bf2545feb3f7e98f4a8009b35a150c03664283f33dfb7687c9565 |
Hashes for stringzilla-1.1.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7aa1394fbbac17c41853995f55dbc8228aa117408ece93c0bd5b57f3fd4df03c |
|
MD5 | b7158043be614da0b1b0aca314dc249a |
|
BLAKE2b-256 | 404e6ae25036b142cc869b457ad58d3ebef687dd98d33b9158e4ff322548bd65 |
Hashes for stringzilla-1.1.3-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2180823905b1429f21f5410be9d0da762fe16fbfdcfdc4f10b0a06490529b92 |
|
MD5 | 11cc8d35a99adadb79e962f0f1f7f1b5 |
|
BLAKE2b-256 | ee7dd7df90eb5397a8788eababd1e3597a64bc0c9dffb2c697710764434c3dd5 |
Hashes for stringzilla-1.1.3-cp311-cp311-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34af43bf79e46b0c5e447b9fa8f55df037db2447551b69436b3596242287aa15 |
|
MD5 | 2d6def737600ce7b25a681a2c128a7c2 |
|
BLAKE2b-256 | 0b51fa2f09cb1eadf52eb43f02335ffd540ae41f7266be065d8e603a1a50e023 |
Hashes for stringzilla-1.1.3-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 43d95e757ba5451dd08d4c45bb883f2a9e102c18dbbb9dbe0d34677670267e35 |
|
MD5 | e4915c39aecf1396ad00d4086ead277e |
|
BLAKE2b-256 | 50a042b74ecf97178a329571868a0026c83a0986936bc6096453a817f6be797f |
Hashes for stringzilla-1.1.3-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f398391f6d1feed1d4dddfc6ee81551b56fa3156f64f938091fa47558d9b74f |
|
MD5 | 03f0eb074d60eddf8d96f52c982db026 |
|
BLAKE2b-256 | 84fa32e2f733c557773e473b0c501fdc1e10c267681e28c508df518bacd6c63a |
Hashes for stringzilla-1.1.3-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d197edd49120c5c541ba9a9bba86f125339babd5899ec87a9840075cc3d96edc |
|
MD5 | ecfa504eabd74f35b5b02803628c59cc |
|
BLAKE2b-256 | d30a6a30ac3f5eeaf1155ff2ab58b965730d2f8fbfcae298015b52c730b128b7 |
Hashes for stringzilla-1.1.3-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c653fc29f42e6c6c506e8bd697e9050acf7ac7fe94fa1993892b6fa80f734df6 |
|
MD5 | 57b8b771076b3defe4e5392996435cf3 |
|
BLAKE2b-256 | 5a1198545737a782f25b321fbe2f279f11d6911db235b549562c3aef4ace355d |
Hashes for stringzilla-1.1.3-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c03d3d207fc5ddc0a2e9c42535efcde5bf9e840eb1a3e8039c71aef469ec469 |
|
MD5 | 07dacf0ae6478228e4debcba3590d58f |
|
BLAKE2b-256 | c30bc3b388b5ad7424d143b258e4844db61e4edc7ae452dace6243536bd8a595 |
Hashes for stringzilla-1.1.3-cp310-cp310-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38e32a23c8df09124cc3cb15458eeca8138d948038a188559a0cd6a64a619f75 |
|
MD5 | af926b82bc4bd965d653a73c8735f6e3 |
|
BLAKE2b-256 | b9266fb67b56829b927dd35eea6c45d929d304a361f75bb3247e1d748d328cab |
Hashes for stringzilla-1.1.3-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74426a3395127e64490ec947df6f3cd05b4f89b843e37b058e5371eaec0ed346 |
|
MD5 | 84db969eade103ea714240fb2a407746 |
|
BLAKE2b-256 | 25b2b87644f878fab4544ed87b8a9992dca7afe25a233408d72f5f23372d8630 |
Hashes for stringzilla-1.1.3-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c37bcf2e20acedbae4c7491d2a3a2303f6c227f713f22dd26f3f1ef6309b6fe |
|
MD5 | 2bfa9ec76f2d54046a99f08c26864375 |
|
BLAKE2b-256 | ff111882f2623b176860a768f382866dcee3518b366fa3a63a391bdcbfee4d69 |
Hashes for stringzilla-1.1.3-cp39-cp39-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 607b3a6138c633bfbfb979376cbe1499bb6ccf91235d859b1988d7d1055bd5e2 |
|
MD5 | 607322ad5eb91d5e2340575e9f910056 |
|
BLAKE2b-256 | f47fe10fe148aee8ef129a0fd116e825750668273f1a5908b5dd29dc86ae536a |
Hashes for stringzilla-1.1.3-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 513efbd0c6124aeffee5c4d598916bba9a634c2d3f5984e93d7b24566bf83c82 |
|
MD5 | 3661cc28eab68bd0015228d9ac2a0c42 |
|
BLAKE2b-256 | b01356d2043ac0fc47d28243a2985ee9d73b58ab11d82dea7cce322052582a49 |
Hashes for stringzilla-1.1.3-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9bb85d64685e5b0a03bb3cf2dd679767b27cad258a4b305ea28553b8ada2808b |
|
MD5 | caf5e91ba29f09de1c861ddb70f3467a |
|
BLAKE2b-256 | 39287335d75952e2ac0b3ef18fc83a366f1d0b21e3303387e28ebe9cdcd74a09 |
Hashes for stringzilla-1.1.3-cp39-cp39-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dab935008d5662de068f487f09b1e387a9ee26b29fb7521d759220d5f4bf73cf |
|
MD5 | af1c5c4f0db832461b85651144543584 |
|
BLAKE2b-256 | 1e5e12bc4141d650da458a4f2d2848c2c6595fe9ad48cc4176ff2e7762d32318 |
Hashes for stringzilla-1.1.3-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95ccb0b7f281989302c41e15d461c4f30cb203844a575bb17f1621544b43a8e3 |
|
MD5 | d0d4d7b16f1891d5d7c2ec6d9c94ec66 |
|
BLAKE2b-256 | 29dc0263b9338d8ac01421ad9a76366ff6149f857686fafd2fe6f0c4e8a54c2c |
Hashes for stringzilla-1.1.3-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb9dcac0163264ae4ab546c66ab53b35b33a75dc40e6589c1de1736053b97c6a |
|
MD5 | b4cc33e862b858e7c5742e9cce5a2b89 |
|
BLAKE2b-256 | a0d5e80ff4a8897a4b8d2572748cf4b2961840b4a310eb8e388d41093ca854fa |
Hashes for stringzilla-1.1.3-cp38-cp38-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e199c563ba62efeee43a72ae6aa01687ca76ae231615f91257bba810851b4a86 |
|
MD5 | a066c8351eee086ce7f06aafe5c56391 |
|
BLAKE2b-256 | 1e8a2f35279149dca0eed85d926ad3098e0565b0b9c56b3448c25d1d572dfd2e |
Hashes for stringzilla-1.1.3-cp38-cp38-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f478afd8e66abd2336ce1be56df165d1049a9cf3024384feb85a95428fe711d0 |
|
MD5 | 799bcddb6c807b272c0f23e68ab0c76d |
|
BLAKE2b-256 | d6d69ba67de73eebda3ff2224e3c26a887a1e84d6e0ca2124090910c76ae7b96 |
Hashes for stringzilla-1.1.3-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e3c9fef84e0a54cbe33dc5ec87341c695912066e175e9fcea63c378f5b443bc |
|
MD5 | d1a29ca1d3a09e7efb9e182f4ea3b8de |
|
BLAKE2b-256 | 374c6553b775bd564a08b6886dcea58a3f08b2f73a9d0dbc1f10d913f22aafc3 |
Hashes for stringzilla-1.1.3-cp38-cp38-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2fd7ddc3bb23c7056d081473e231d89dd092dc8f7a7d6a55e5537ec6f69e6b0 |
|
MD5 | 481613489c70e4a32fd45e5d82dfec6f |
|
BLAKE2b-256 | 0da9252dc91ed39925aa6325bbf1fcfdb142fb53f01b8fb7858f2dc19ac7e4e1 |
Hashes for stringzilla-1.1.3-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e35d55b23a432a8e9f41d2620d03ac7e564947412cbd184e17f351fcafed88c |
|
MD5 | b69d5285a4286c3eca8049744620a508 |
|
BLAKE2b-256 | a5a2be1949e87dde5835b4362a824fb1c2cdf2396d3dc8acb6b78080430046c6 |
Hashes for stringzilla-1.1.3-cp37-cp37m-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e85471219a6a69323876a5ec801ea13522c0b90c6949b6819e36e73467cf185 |
|
MD5 | f438dce43e7186f9cc223841c4cb5c0d |
|
BLAKE2b-256 | 3d8ef498226f970132a36b0940219f0595992a41466ec70cff7c3bb8ff584ed6 |
Hashes for stringzilla-1.1.3-cp37-cp37m-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0fc8bc88074b3a1faf2819dfcd9b08b1402ace33fcc20f504290d56e042fa64 |
|
MD5 | 443e9b909d0579eb9dae5bb0c84be3bb |
|
BLAKE2b-256 | 5b6f8437f08843abe118825cc0eb0beb59773966f5d3044fa08b143708b1713b |
Hashes for stringzilla-1.1.3-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bac4ec6f201466351bb5e8f351ea67c35ba8ec304e8593feb251106d5083ffca |
|
MD5 | aa2499e402a7b458369297a453063510 |
|
BLAKE2b-256 | d9ff3dade08b8221d960b59122d77b02a298d6850ee8d7ae30e38078651e2313 |
Hashes for stringzilla-1.1.3-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b440a7244cb45095ef6075d688315ff325847600cb83a4ba5aae9233f5b1344 |
|
MD5 | 1ba7210b05ff338a0964709af72295bb |
|
BLAKE2b-256 | 510b6cc4f82bdd602c369863715286f9c9bd73d6800ba15176d7eeccd862adb6 |
Hashes for stringzilla-1.1.3-cp36-cp36m-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 49e2fc5d77f3667fb89070b9730f7d02de087765c72e8103e1330d7e7b43c27c |
|
MD5 | 331df3b13b31bedd658e01ea6653dbeb |
|
BLAKE2b-256 | d39b0fc4ff27e4f3a97100f7aec3c7f3bb7e7bedb58e5c763127fc3d71285dcc |
Hashes for stringzilla-1.1.3-cp36-cp36m-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e8faff5b60169a698bbc4670386028bf18ded3b92708dae9bb8864d6b7cd784 |
|
MD5 | af1d92b1bf8323cbe3205dcc93f05b58 |
|
BLAKE2b-256 | 772e0560e78fb700b51e67457ec119b9c0ea7d55e7eec1ce746359dc021f974c |
Hashes for stringzilla-1.1.3-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 424668c2dc608076c43110a70c25197d172096586cd06d9375d20f6b0952ff0a |
|
MD5 | a70d6e0108cc0e32bd7f51ddb0194ac6 |
|
BLAKE2b-256 | f1d853eae616761fdc6ba5ed46c7abfefb4b0355f9f880a1d0c65f9d77698a3a |