Skip to main content

Binary Search in plaintext and Gzip files.

Project description

BREP : Binary Search in plaintext and gzip files

Search large files in O(log n) time using binary search.
We support plaintext and Gzipped files.

Benchmark : 8x faster than grep on a 2GB dataset !

brep is usually faster than grep for >1GB datasets.

Check tests/benchmark.py to reproduce the results.

grep ^777 test.txt : 1.594 s (15 runs)
brep 777 test.txt : 206.8 ms (15 runs)

Installation

pip install brep or pip install . from this repo

Index your file

In order to conduct binary search, your file needs to be sorted.
We recommend GNU sort, as it's multithreaded and supports large files.
LC_ALL=C sort -u -o output_file input_file

BREP supports compressed file in the GZIP format.
We recommend pigz for quick multicore compression : pigz file

Usage

Provide 1 prefix search term and 1 filepath
brep 77777 test/large.gz

You can also search from our Python class

from brep import Search

for result in Search("77777", "test/large.gz"):
    print(result)

Contribute

PRs are welcome!

Install dev dependencies: pip install -e .[dev]
Test and lint before submitting: pytest && flake8

Todo

  • Reimplement in Rust
  • Faster gz size estimation
  • Search multiple strings at once

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

brep-1.0.1.tar.gz (2.8 kB view details)

Uploaded Source

File details

Details for the file brep-1.0.1.tar.gz.

File metadata

  • Download URL: brep-1.0.1.tar.gz
  • Upload date:
  • Size: 2.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.0.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.7

File hashes

Hashes for brep-1.0.1.tar.gz
Algorithm Hash digest
SHA256 95fdea5f1dddfe328134fd9dc06a9593e3760dabb7b1b0c324ce4b6827a99a67
MD5 f6e2b30929d3bd43535327802680db00
BLAKE2b-256 7d4314e85c52614e6601968969d2a7e590586982942d2c34e4e8757dbd1eec1a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page