Write streaming data to Parquet files with automatic sharding.

Project description

parquet-stream-writer

parquet-stream-writer enables streaming data to be written to Parquet files with automatic sharding (splitting data across multiple files). When a file reaches a user-defined size limit, the writer automatically creates a new file. This prevents the accumulation of unwieldy, monolithic Parquet files during stream processing.

Installation

You can install parquet-stream-writer from PyPI using pip or from conda-forge with Pixi.

Using pip

pip install parquet-stream-writer

Using pixi

pixi init my_workspace && cd my_workspace
pixi add parquet-stream-writer

Usage

The library's core class is ParquetStreamWriter, which works as a context manager and lets you write data incrementally using its write_batch method.

import pyarrow as pa
from parquet_stream_writer import ParquetStreamWriter

# Define your schema
schema = pa.schema(
    [("col_a", pa.int64()), ("col_b", pa.string()), ("col_c", pa.bool_())]
)

# Simulate a data stream
def data_stream():
    for i in range(1_000):
        yield {"col_a": [i, i + 1], "col_b": ["foo", "bar"], "col_c": [True, False]}

# Initialize an instance of `ParquetStreamWriter` and write data to `output_data.parquet`
with ParquetStreamWriter("output_data.parquet", schema, overwrite=True) as writer:
    for batch in data_stream():
        writer.write_batch(batch)

Writing with automatic sharding

By default, ParquetStreamWriter writes to a single Parquet file. However, you can enable automatic sharding to split the output into multiple files based on a specified size threshold. To do that, use the shard_size_bytes to set the approximate maximum uncompressed size for each file. In this mode, path acts as the base directory where shards will be written.

When sharding is enabled, the prefix of the generated files defaults to the name of the output directory. For example, if path="my_dataset", the files will be named my_dataset-0.parquet, my_dataset-1.parquet, etc. You can override this using the file_prefix parameter.

with ParquetStreamWriter(
    "my_dataset",                        # Base directory path
    schema,
    shard_size_bytes=50 * 1024 * 1024,   # Shards will be approx. 50 MiB
    file_prefix="prefix",                # Custom prefix
) as writer:
    for batch in data_stream():
        writer.write_batch(batch)

Configuring buffer size

By default, ParquetStreamWriter uses an in-memory buffer of 16 MiB to accumulate data before writing it to disk. You can adjust this size using the buffer_size_bytes parameter. A larger buffer can improve write performance by reducing the number of write operations, but it also increases memory usage. Smaller buffers will lead to more frequent writes and larger files, as encoding overhead is incurred with each write.

with ParquetStreamWriter(
    "my_dataset",                        # Base directory path
    schema,
    buffer_size_bytes=200 * 1024 * 1024,   # The in-memory buffer will be approx. 200 MiB
) as writer:
    for batch in data_stream():
        writer.write_batch(batch)

Configuring row group size

The row_group_size parameter controls how rows are grouped together within the file. By default, it is set to None, which means the group size will be either the total number of rows or 1,048,576, whichever is smaller. Setting a specific value, like 10,000, can make searching and filtering faster because it allows the reader to skip over groups of rows that don't match what you're looking for.

with ParquetStreamWriter(
    "output_data.parquet",
    schema,
    overwrite=True,
    row_group_size=10_000
) as writer:
    for batch in data_stream():
        writer.write_batch(batch)

Passing additional parameters to `ParquetWriter`

ParquetStreamWriter uses PyArrow's ParquetWriter class under the hood. You can further customize the Parquet writing behavior by passing any additional parameters supported by ParquetWriter via **kwargs.

with ParquetStreamWriter(
    "output_data.parquet",
    schema,
    overwrite=True
    compression="zstd"                  # Use ZSTD for compression
    use_content_defined_chunking=True,  # Write data pages according to content-defined chunk boundaries
) as writer:
    for batch in data_stream():
        writer.write_batch(batch)

Accessing created files

After the writer closes, you can inspect which files it created via the written_files attribute.

# The 'writer' object stores a list of the files it created
print("Data was written to the following files:")
for file_path in writer.written_files:
print(f"{file_path}: {file_path.stat().st_size} bytes")