Crunch 100+ GB Strings in Python with ease
Project description
Stringzilla 🦖
Crunch 100+ GB Strings in Python with ease, leveraging SIMD Assembly
Stringzilla was born many years ago as a tutorial for SIMD accelerated string-processing.
But one day, processing 100+ GB of Chemistry and AI datasets, I transformed it into a library.
It's designed to replace open(...).readlines()
, str().splitlines()
, and many other typical workloads with very long strings.
|
|
||||||||||||||||
Usage
pip install stringzilla
There are two classes you can use interchangeably:
from stringzilla import Str, File, Strs
text: str = 'some-string'
text: Str = Str('some-string')
text: File = File('some-file.txt')
Once constructed, the following interfaces are supported:
len(text) -> int
'substring' in text -> bool
text[42] -> str
text[42:46] -> str
text.contains(
'subtring',
start=0, # optional
end=9223372036854775807, # optional
) -> bool
text.find(
'subtring',
start=0, # optional
end=9223372036854775807, # optional
) -> int
text.count(
'subtring',
start=0, # optional
end=9223372036854775807, # optional
**, # non-traditional arguments:
allowoverlap=False, # optional
) -> int
text.splitlines(
keeplinebreaks=False, # optional
**, # non-traditional arguments:
separator='\n', # optional
) -> Strs # similar to list[str]
text.split(
separator=' ', # optional
maxsplit=9223372036854775807, # optional
**, # non-traditional arguments:
keepseparator=False, # optional
) -> Strs # similar to list[str]
Once split, you can sort, shuffle, and perform other collection-level operations on strings:
lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)
sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)
lines.append(shuffled_copy.pop(0))
lines.append('Pythonic string')
lines.extend(shuffled_copy)
Development
rm -rf build && pip install -e . && pytest scripts/test.py -s -x
pip install -e . --no-index --no-deps
To benchmark on some custom file and pattern combinations:
python scripts/bench.py --haystack_path "your file" --needle "your pattern"
To benchmark on synthetic data:
python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"
To validate packaging:
cibuildwheel --platform linux
Compiling C++ tests:
brew install libomp llvm
cmake -B ./build_release \
-DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
-DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
-DSTRINGZILLA_USE_OPENMP=1 \
-DSTRINGZILLA_BUILD_TEST=1 \
&& \
make -C ./build_release -j && ./build_release/stringzilla_test
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for stringzilla-1.1.0-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e70eb6d2718e238006dfa06176435472d5a3b21ac94c507e4d605bf757e590bf |
|
MD5 | e4c71e7bf02811c62980f0f165d40d12 |
|
BLAKE2b-256 | c673bfdc47f5bf4018d1c137071b445ec112f4c2a2fe7db907ca624447ec5ace |
Hashes for stringzilla-1.1.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 338891ea2e5c1b0ce3e7bb6d3dd66460bc2430f695b6baedf00128e286a5a1b0 |
|
MD5 | c82cd6b2e3b12912a37db99b24df6d8a |
|
BLAKE2b-256 | bc78fde6ae9f987c17f55c32d1b0b526ad91c0e5e02ad25e8a1a05ff1fc1999a |
Hashes for stringzilla-1.1.0-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 846677e1cf3e8468547d1dd485f6c176c0bbab16519c595807874fc38cb03edb |
|
MD5 | ffbedbb65202027d3e5dda5b7a707156 |
|
BLAKE2b-256 | 7cf5e2f453089e3442932488d0b62428e8448b0f0c56a12e3367d4ca81967c9f |
Hashes for stringzilla-1.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c00754c7ba482c3a94d62061a8e42f04517ae04c4d0c707acc8f4165c59edcc |
|
MD5 | 031675fb3bad40157f9495c30383f1b1 |
|
BLAKE2b-256 | 1c892af125c5d8874deb4b2d2e4fc9070fa29f919bf63e063ccdcb62b44cd1e0 |
Hashes for stringzilla-1.1.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4300d124090d74a41463778a43d9d39544d144bc24cefcc63e1e602de5c153c |
|
MD5 | 0f1db8bcd87d96020970b73fb6db5d88 |
|
BLAKE2b-256 | 300a63a9fb9572a4d34951c872e2d5a0019f9355ffcf8455f7e0f5fc7f2bdf08 |
Hashes for stringzilla-1.1.0-cp311-cp311-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0857841fa02766fff2c8f58a19035250f7282eb2df1f52064f358243f49eaea5 |
|
MD5 | dd19a666d3eaacce0cc0bf2acd40e8cb |
|
BLAKE2b-256 | 36be81d6c59e0a4ef656b03bf35586d0b87ceb79ac3b53aa10936f50008a1269 |
Hashes for stringzilla-1.1.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ada95e37b6e016ad6ed89e0664bc69274411127f4439b423c7db9c4cd9e292d |
|
MD5 | a6a277621369c4d81d6000cb2ed6fbc2 |
|
BLAKE2b-256 | 445b27adbc2c14670037ddd302f722095085bbbcd38448819ebc23044c5643a0 |
Hashes for stringzilla-1.1.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 827b2f8a17e0579d7ae9503274a16cfa61fbbea61f6fd3435eaeef9e5d3894a9 |
|
MD5 | 8bde49c229f84ddb892d1a25dc9f22a2 |
|
BLAKE2b-256 | 2fb1288a80fb64ce9a81ba650f565be280a8e182d0d91af59fcb24243c871508 |
Hashes for stringzilla-1.1.0-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8248d2ec6e526726c41e80aa2cb881cad1a7b8a35da489d30b8a90ad1ed9dd7b |
|
MD5 | c75cedd4105e7befe9144f3f37972a89 |
|
BLAKE2b-256 | f85c0edc3008728e0c8e2bab81d50080cd7fecf8d5c37d5c080f94df07b532aa |
Hashes for stringzilla-1.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ece0cebc7aefa408076c11c6437ef6cae76b285799e380feb8e55ceffc4aaece |
|
MD5 | 106541859551abea7c2454350b0d2820 |
|
BLAKE2b-256 | 8b310220d2e52d9b5d69dea17706020958db903345b09001140b8a14a897123a |
Hashes for stringzilla-1.1.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a252e7b4e7dd894ad51082c161a5050f073e00252f909dfebf8a3b9c01dad44 |
|
MD5 | a1a35f9025fab05b2a11b36ab58d716d |
|
BLAKE2b-256 | fe5401362b9e829586ea36fac740587df50d4d25d0edb5208b007437662cbd9a |
Hashes for stringzilla-1.1.0-cp310-cp310-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62058505dfeac60e34945de6a33a79227448800475c23678a2b46abb38d767fe |
|
MD5 | e3131d48ab23dd198e4f43721c59f121 |
|
BLAKE2b-256 | 8835927df47f33a69ec8d431a9f379ff69761a7532ed98f7e5e07b4ff79f78e9 |
Hashes for stringzilla-1.1.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 27463177d513d478b1a1b6c8a8b9c73872a7d9247d94f5c701142e0538311f1b |
|
MD5 | a94ac54d272e7cac1ee53e84b2094670 |
|
BLAKE2b-256 | 230af69c6603227a1ea048d568c981517f71fb642038acbdee5b0664e4d5ce84 |
Hashes for stringzilla-1.1.0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | db946dfe68bf3978d61fb1a6a2d58b0fde9afbcddf3b6d726009982bee853722 |
|
MD5 | f85050de41e58dfa59fee9efc5223e68 |
|
BLAKE2b-256 | 8eb933a402229a0dbf6c1aeb21cd09aa5f612cf60cd3ab2e91a0de84f019529a |
Hashes for stringzilla-1.1.0-cp39-cp39-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a11d0cec19edbcf86a2eb997feaf23d8571b605838191739590b1ae1d13a220 |
|
MD5 | 6f9ea30baaa9f69f56d7e480ac1674b2 |
|
BLAKE2b-256 | c494e168dfbbab08d9753b5e6522da380715b59b93926ac45baebd7b08167c7c |
Hashes for stringzilla-1.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30d30b656fc6f4978be71750a7c9781453de898b9ca48b6b4793cfd87116038a |
|
MD5 | 200bcc56764363e45fe2a42fdb926b2f |
|
BLAKE2b-256 | d82ebd8dc087fce6d6631b1e6c80bb2db1b6cb85dcea75133c0f79df38195ca5 |
Hashes for stringzilla-1.1.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a529f75665a3e3346f77fd9d502011e9fabf5df22469816fdf6725efadce6106 |
|
MD5 | 84ba65cbd9dc2345c12ee031fcdff0ce |
|
BLAKE2b-256 | bb4f27a79e21621943f86d5b0cff6d91725e562855adf4c6be2ddfd9d6ce1852 |
Hashes for stringzilla-1.1.0-cp39-cp39-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e67ceddcf5001a0215dc799008c06ee5b41cbc88403e978b8d5ff8e52c51cfb |
|
MD5 | a728ae948eb1c4b02f7af3cf0dacca54 |
|
BLAKE2b-256 | c9ef0da27b734cd8d82b837057d31a4bd66c3eaad186ed2420028dd2f695bd18 |
Hashes for stringzilla-1.1.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b0e050167ca04a443e2a53eaaab54673bcdb941c487f29db7d470eaf378a373 |
|
MD5 | 778debc6655f447746c5954e9bcd74ee |
|
BLAKE2b-256 | 38600b640a3a0307970337bce76958d73d97c69721be3beadf1361090b266a57 |
Hashes for stringzilla-1.1.0-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 398be755f837ff8e1a3c1b1e21acb6cf8f8ff91ff585a328ecb72f2ddbe0af54 |
|
MD5 | 0a342df11539a5badbf6ee49988cc198 |
|
BLAKE2b-256 | 92be6729f673ff26574f341f7ca0428993e72f33680372546ed996a148b79b26 |
Hashes for stringzilla-1.1.0-cp38-cp38-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 262c01ac57bc80c2aafe390aad20ee37944e701d8deb53f85b2ace8a5c74a30d |
|
MD5 | 99878d55586d130fc44f75f09e1660fe |
|
BLAKE2b-256 | 2d7f656fd397fa7b5bf98c3e2fda74f04b1e21da5af48eb9254c270251ef1267 |
Hashes for stringzilla-1.1.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7fb58e2a3d256b128fe579b69216505aba40d569752e0919171417f7a656538e |
|
MD5 | be7830f18355a2020100e223414857ec |
|
BLAKE2b-256 | 362b0cb08f7c9a7090ae020f2305d117a81250035c68980e04fe4e77028e3b2f |
Hashes for stringzilla-1.1.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 336fa2a35ec3a5c963455d34a19619e3cd2732f2a5f0b821c5ca074f5759a087 |
|
MD5 | 7d91763cc44a594e42f38efef638936d |
|
BLAKE2b-256 | 760b5bab32314c421a27cce35eff5a0865dd3535df7c1621030d8ed726d4317e |
Hashes for stringzilla-1.1.0-cp38-cp38-macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18d8cda999e3ffb54e3ee975c952b81997db8575882ed1a8e35770971fdc94d5 |
|
MD5 | 7b02d398fc2bb1b85c43795cee801581 |
|
BLAKE2b-256 | 67c424a765e35f20e90a2c47a002228f83d23b3c505d6e1b41fb6655ee3fd229 |
Hashes for stringzilla-1.1.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 17ba74fe0237a69e186eed2e094344bb50149437da26109171efc1fd411de672 |
|
MD5 | 8e7f264cbac7652fa51a6d79fdba5a70 |
|
BLAKE2b-256 | d83fbd278a22ac72a5fb8bb1b7f53bccf70740477b0373b8b6a540cf25184924 |
Hashes for stringzilla-1.1.0-cp37-cp37m-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0188e1c251e68301c5e13321a4a3cc6387661d26969434e9a9ea9702f0d1b811 |
|
MD5 | 6b7325cd4964597bc4f92f53350958e6 |
|
BLAKE2b-256 | 26b4555046c7a93ae4776e7deb35893ea2ebdca86bd78ce3408e5109ebd0ff97 |
Hashes for stringzilla-1.1.0-cp37-cp37m-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c1f7a2f6fbacca68d57d45567c0ab0af5df95328826af205e1ec15cd0f95430 |
|
MD5 | c036128bb446a9d86956bcb97c6e3b55 |
|
BLAKE2b-256 | f28d6ff7aecc643f35efff399df8555d9f2abf1db6b3f3b605edfd6b523576d3 |
Hashes for stringzilla-1.1.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b44e1f28fb3467a859fb6f837e9ab57e08397f374e94b82f194f9a71a927970 |
|
MD5 | ed48de0d0eb44c9a80dc50307a776d0f |
|
BLAKE2b-256 | 4b1e91900ed08b713d69063f9cb666798139ccde4a38fb194ea132a88a2ea7c5 |
Hashes for stringzilla-1.1.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0610b83c8bc9a68c88714815511144098345c447c953bf5b9acac26c29d5c366 |
|
MD5 | 89f2e93560d1c211662e20b13455da43 |
|
BLAKE2b-256 | 43b884922e15cdd8685ffd39d4b9731efdfcc21f4be042eafd5db9e24b6b5b97 |
Hashes for stringzilla-1.1.0-cp36-cp36m-manylinux_2_28_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c267be350b037c79ee64cb8ac4342394ea7693ba7f445dab7af99a40296178b |
|
MD5 | d19b964d9834f303db69906fdd93c672 |
|
BLAKE2b-256 | 72734631c3006bf6a405fdb41ac29b9688f9e3b8c96c650c6407b41d516f7247 |
Hashes for stringzilla-1.1.0-cp36-cp36m-manylinux_2_28_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7db1f02349c5d7c15d00d273a59556bea1857bd7df0a91ada1aea053a934d836 |
|
MD5 | 86a5b1ec5e3add0ac7dd10e42af930a9 |
|
BLAKE2b-256 | a0de1ec6901eee7ff464373815cfc7e566cab0b6fc66da6356639ff7c64c7587 |
Hashes for stringzilla-1.1.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 243c2a1b5d9eb785fec9fe211ba4a533ee76852628f87059a9a6ed282ab63055 |
|
MD5 | f37afbbcddeba7e2da15caf74d989585 |
|
BLAKE2b-256 | bfae99993998f5815d4d5af6f3edc4960e7f27f81a03ba29a37402ed9825c82c |