Skip to main content

No project description provided

Project description

G-Zeus

is a package that chunk-reads GZipped text files LIGHTENING fast.

What is this package for?

This package is designed for workloads that

  1. Need to read data from a very large .csv.gz file

  2. You have additional rules that you want to apply while reading, and you want to work in a chunk by chunk fashion that saves you memory.

  3. You know what you are doing and prefer a more customizable experience than a package like polars_streaming_csv_decompression

This package provides a Chunker class that will read gz compressed text file by chunks. In the case of csv files, each chunk will represent a proper decompressed csv file and only the first chunk will have header info, if headers are present. The Chunker will produce these chunks in a streaming fashion, thus minimizing memory load.

This package can potentially be used to stream large gzipped text files as well. But is not capable of semantic chunking, which is often needed for text processing for LLMs. This package only chunks by identifying the last needle (new line character) in the haystack (text string) in the current buffer.

Assumptions

The new_line_symbol provided by the user only shows up in the underlying text file as new line symbol.

We get decompressed bytes by chunks, then what?

Most of the times, we only need to extract partial data from large .csv.gz files. This is where the combination of GZeus and Polars really shines.

If you have Polars installed already:

from gzeus import stream_polars_csv_gz

for output_of_your_func in stream_polars_csv_gz("PATH TO YOUR DATA", func = your_func):
    # do work on the output of your func

where your_func should be pl.LazyFrame -> Any. If you need more control over the iteration and data bytes, you can structure you code as below:

from gzeus import Chunker
import polars as pl

# Turn portion of the produced bytes into a DataFrame.
def bytes_into_df(df:pl.LazyFrame) -> pl.DataFrame:
    return df.filter(
        pl.col("City_Category") == 'A'
    ).select("City_Category", "Primary_Bank_Type", "Source").collect()

ck = (
    Chunker(buffer_size=1_000_000, new_line_symbol='\n')
    .with_local_file("../data/test.csv.gz")
)

df_temp = pl.scan_csv(ck.read_one()) # first chunk
schema = df_temp.collect_schema() # Infer schema from first chunk
dfs = [bytes_into_df(df_temp)]

dfs.extend(
    bytes_into_df(
        pl.scan_csv(byte_chunk, has_header=False, schema=schema)
    )
    for byte_chunk in ck.chunks()
)

df = pl.concat(dfs)
df.head()

Performance vs. Pandas

See here.

It is extremely hard to have an apples-to-apples comparison with other tools. Here I will focus on the comparison with pandas.read_csv, which has an iterator option. Note: GZeus chunks are defined by byte-sizes, while pandas.read_csv iterator has a fixed number of rows per chunk.

However, generally speaking, I find that for .csv.gz files:

  1. GZeus + Polars is at least a 50% reduction in time than pd.read_csv with zero additional work on each chunk
  2. If you set higher buffer size, GZeus + Polars can take only 1/5 of the time of pandas.read_csv.
  3. Even faster with more workload per chunk (mostly because of Polars).

Cloud Files

To support "chunk read" from any major cloud provider is no easy task. Not only will it require an async interface in Rust, which is much harder to write and maintain, but there are also performance issues related to getting only a small chunk of data each time. To name a few:

  1. Increase the number of calls to the storage
  2. Repeatedly opening the file and seeking to the last read position.
  3. Rate limits issues, especially with VPN. E.g. to get better performance, gzeus needs to read 10MB+ per chunk, but this will increase "packets per second" significantly.

A workaround is to use temp files. For example, for AWS s3, one can do the following:

import tempfile
import boto3

s3 = boto3.client('s3')

tmp = tempfile.NamedTemporaryFile()
s3.download_fileobj('amzn-s3-demo-bucket', 'OBJECT_NAME', tmp)
df = chunk_load_data_using_gzeus(tmp.name) # a wrapper function for the code shown above.
tmp.close()

Almost always, the machine should have enough disk space. In chunk_load_data_using_gzeus, data is read by chunks and therefore won't lead to OOM errors. It can be any wrapper around stream_polars_csv_gz provided by the package.

Other Projects to Check Out

  1. Dataframe-friendly data analysis package polars_ds
  2. For a more sophisticated but more feature-complete package for streaming csv.gz, see polars_streaming_csv_decompression

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gzeus-0.1.2.tar.gz (16.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gzeus-0.1.2-cp39-abi3-win_amd64.whl (184.4 kB view details)

Uploaded CPython 3.9+Windows x86-64

gzeus-0.1.2-cp39-abi3-manylinux_2_24_aarch64.whl (261.7 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.24+ ARM64

gzeus-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (263.3 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

gzeus-0.1.2-cp39-abi3-macosx_11_0_arm64.whl (234.5 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

gzeus-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl (249.8 kB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file gzeus-0.1.2.tar.gz.

File metadata

  • Download URL: gzeus-0.1.2.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.9.6

File hashes

Hashes for gzeus-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5a0bf8309d72efc35b882a7ea055aaa891d196d244912406dfb19278e0ad1657
MD5 c5d960d46e907203e2f3087cfa0cfab5
BLAKE2b-256 394fdb12dff2cd9768747aae4c5391db4923c1b8ccb60ed8bfede7c88c2c6f69

See more details on using hashes here.

File details

Details for the file gzeus-0.1.2-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: gzeus-0.1.2-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 184.4 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.9.6

File hashes

Hashes for gzeus-0.1.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c3a3e2b61e8ef6a02ce0508e9b5d76cb70048437f3955c503e6de9ec6466abb5
MD5 d5ee53157b1bd12f5ca3743ec25ee23b
BLAKE2b-256 352ebe1269d186b0f5f21546b4c17161819c12b1fe37115627b9b8491b03c7a4

See more details on using hashes here.

File details

Details for the file gzeus-0.1.2-cp39-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for gzeus-0.1.2-cp39-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 8010d5e86f8008681ea48484b8e95920f93a38f3d5207cecdeee435ce1962bdf
MD5 ee0b9b5ca4987a43372a6152fc63e771
BLAKE2b-256 027d065843de61902a405c821bd509d45ca9c553e25a73e70ee8260221e52d2e

See more details on using hashes here.

File details

Details for the file gzeus-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for gzeus-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bedbecdcce93d018da883641cb766400ec659e008c2ce9ba882b2af69e879b95
MD5 73f047592b710b2935d43e6f96b6e524
BLAKE2b-256 f9bbfdd01e5f3e9696af72a77f4aab04a4f9353750a8dec98d3323796318c2e1

See more details on using hashes here.

File details

Details for the file gzeus-0.1.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for gzeus-0.1.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f743d49375f8e7e4ee2185943d50cd4e44eb4504db2bb61e296b5947aa364b9e
MD5 a6076b0d060357b3df195260c70a12b4
BLAKE2b-256 6fb6efe64ccf6b6a3d8a9e802874112837da501a1c1c753f3d0b56d6a0fa9c85

See more details on using hashes here.

File details

Details for the file gzeus-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for gzeus-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2123b826599133c856db4dd95d1788dc780922bfd8e01a953e20100d2c1c38fd
MD5 a8ecc1569108d5dbb89e25e8dc752352
BLAKE2b-256 5349ef5e12eb8a5ff1072bf6a3c469b41e0a96e1b77212535d64e87bfa0abbb9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page