Skip to main content

No project description provided

Project description

GZeus

is a package that chunk-reads GZipped text files LIGHTENING fast.

What is this package for?

This package is designed for workloads that

  1. Need to read data from a very large .csv.gz file

  2. You have additional rules that you want to apply while reading, and you want to work in a chunk by chunk fashion that saves you memory.

  3. You know what you are doing and prefer a more customizable experience than a package like polars_streaming_csv_decompression

This package provides a Chunker class that will read gz compressed text file by chunks. In the case of csv files, each chunk will represent a proper decompressed csv file and only the first chunk will have header info, if headers are present. The Chunker will produce these chunks in a streaming fashion, thus minimizing memory load.

This package can potentially be used to stream large gzipped text files as well. But is not capable of semantic chunking, which is often needed for text processing for LLMs. This package only chunks by identifying the last needle (new line character) in the haystack (text string) in the current buffer.

Assumptions

The new_line_symbol provided by the user only shows up in the underlying text file as new line symbol.

We get decompressed bytes by chunks, then what?

Most of the times, we only need to extract partial data from large .csv.gz files. This is where the combination of GZeus and Polars really shines.

If you have Polars installed already:

from gzeus import stream_polars_csv_gz

for output_of_your_func in stream_polars_csv_gz("PATH TO YOUR DATA", func = your_func):
    # do work on the output of your func

where your_func should be pl.LazyFrame -> Any. If you need more control over the iteration and data bytes, you can structure you code as below:

from gzeus import Chunker
import polars as pl

# Turn portion of the produced bytes into a DataFrame.
def bytes_into_df(df:pl.LazyFrame) -> pl.DataFrame:
    return df.filter(
        pl.col("City_Category") == 'A'
    ).select("City_Category", "Primary_Bank_Type", "Source").collect()

ck = (
    Chunker(buffer_size=1_000_000, new_line_symbol='\n')
    .with_local_file("../data/test.csv.gz")
)

df_temp = pl.scan_csv(ck.read_one()) # first chunk
schema = df_temp.collect_schema() # Infer schema from first chunk
dfs = [bytes_into_df(df_temp)]

dfs.extend(
    bytes_into_df(
        pl.scan_csv(byte_chunk, has_header=False, schema=schema)
    )
    for byte_chunk in ck.chunks()
)

df = pl.concat(dfs)
df.head()

Performance vs. Pandas

See here.

It is extremely hard to have an apples-to-apples comparison with other tools. Here I will focus on the comparison with pandas.read_csv, which has an iterator option. Note: GZeus chunks are defined by byte-sizes, while pandas.read_csv iterator has a fixed number of rows per chunk.

However, generally speaking, I find that for .csv.gz files:

  1. GZeus + Polars is at least a 50% reduction in time than pd.read_csv with zero additional work on each chunk
  2. If you set higher buffer size, GZeus + Polars can take only 1/5 of the time of pandas.read_csv.
  3. Even faster with more workload per chunk (mostly because of Polars).

Cloud Files

To support "chunk read" from any major cloud provider is no easy task. Not only will it require an async interface in Rust, which is much harder to write and maintain, but there are also performance issues related to getting only a small chunk of data each time. To name a few:

  1. Increase the number of calls to the storage
  2. Repeatedly opening the file and seeking to the last read position.
  3. Rate limits issues, especially with VPN. E.g. to get better performance, gzeus needs to read 10MB+ per chunk, but this will increase "packets per second" significantly.

A workaround is to use temp files. For example, for AWS s3, one can do the following:

import tempfile
import boto3

s3 = boto3.client('s3')

tmp = tempfile.NamedTemporaryFile()
s3.download_fileobj('amzn-s3-demo-bucket', 'OBJECT_NAME', tmp)
df = chunk_load_data_using_gzeus(tmp.name) # a wrapper function for the code shown above.
tmp.close()

Almost always, the machine should have enough disk space. In chunk_load_data_using_gzeus, data is read by chunks and therefore won't lead to OOM errors. It can be any wrapper around stream_polars_csv_gz provided by the package.

Road Maps

  1. To be decided

Other Projects to Check Out

  1. Dataframe-friendly data analysis package polars_ds
  2. For a more sophisticated but more feature-complete package for streaming csv.gz, see polars_streaming_csv_decompression

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gzeus-0.1.1.tar.gz (20.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gzeus-0.1.1-cp39-abi3-win_amd64.whl (211.9 kB view details)

Uploaded CPython 3.9+Windows x86-64

gzeus-0.1.1-cp39-abi3-manylinux_2_24_aarch64.whl (269.0 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.24+ ARM64

gzeus-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.1 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

gzeus-0.1.1-cp39-abi3-macosx_11_0_arm64.whl (248.2 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

gzeus-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl (270.9 kB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file gzeus-0.1.1.tar.gz.

File metadata

  • Download URL: gzeus-0.1.1.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.2

File hashes

Hashes for gzeus-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a6f8cfdea9f70e7f616f4c43bcadedef81b4ceb6e279101a996231fb5aa72ad8
MD5 d67c25e8db0145eecc27615787e29453
BLAKE2b-256 b82fd726dc72ce64ffde44488cafcfff77d586eb2a89c7ce584139f22066fdb7

See more details on using hashes here.

File details

Details for the file gzeus-0.1.1-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: gzeus-0.1.1-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 211.9 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.2

File hashes

Hashes for gzeus-0.1.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f758993f5e6cbc16090727eec016c27de2f994105fe9e9a2d0a1757e9abcc517
MD5 583840069675f89c2374971a3d070633
BLAKE2b-256 1520a40c5a336ce2a24b23403e180012cd3ad6ed5019552b9cb1de9049917e31

See more details on using hashes here.

File details

Details for the file gzeus-0.1.1-cp39-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for gzeus-0.1.1-cp39-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 4a3b8fdbcd4eaeb85f3d980e1c74a55cb2c83ef19f73d79f037cbd5978b34ef0
MD5 841c8ee40767818862e5a152cf5a155c
BLAKE2b-256 0265e820ec6a5eaebcd7c6c778d00eaf1f50ce4a47b5ccc43a75369524ffa410

See more details on using hashes here.

File details

Details for the file gzeus-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for gzeus-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9a127cafccf894f186a60442087d1d7c4d2af6d6aad95434928024d1754b78cb
MD5 4fb3b54b830b5a18c6a8e02e28bc4d25
BLAKE2b-256 75ddfcde9b15a9c68af31f9ca3aed046856afd8a0964d1f90807a62cb4f68aa0

See more details on using hashes here.

File details

Details for the file gzeus-0.1.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for gzeus-0.1.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8044813aa50a63d9e63eadd777aa014c3d8936bd45e8677a18534d930ea0b08b
MD5 1bbed162a226fe89177c2c56b62d4053
BLAKE2b-256 143dcdc49f132c555cb5f49764aa4c82673108b0b9825a515c7ba390287a7966

See more details on using hashes here.

File details

Details for the file gzeus-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for gzeus-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c1dd41284015f8a4ca62fda65386c8650d4bd556f487bb4e57ff7f743f06f87a
MD5 cbe00dd1602dc831c01f6d95e56a1a5a
BLAKE2b-256 af28ed2a2d6565a439dd2ad9ae8f79f6dee436d0953e2a9659e9640a704eb8b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page