Write streaming data to Parquet files with automatic file rollover.
Project description
parquet-stream-writer
parquet-stream-writer provides a memory-efficient way to write streaming data to Parquet. It buffers incoming records and writes them incrementally to disk. When a configurable size threshold is reached, it starts a new Parquet shard, avoiding the need to load the entire dataset into memory. This makes this library suitable for datasets that are too large to fit in the available memory or for continuously generated data.
Installation
You can install parquet-stream-writer from PyPI using pip or from conda-forge with Pixi.
Using pip
pip install parquet-stream-writer
Using pixi
pixi init my_workspace && cd my_workspace
pixi add parquet-stream-writer
Usage
The core class is ParquetStreamWriter. It works as a context manager to ensure files are closed properly.
import pyarrow as pa
from parquet_stream_writer import ParquetStreamWriter
# Define your schema
schema = pa.schema([
("timestamp", pa.int64()),
("event_type", pa.string()),
("value", pa.float64())
])
# Simulate a data stream
def data_stream():
for i in range(100):
yield {
"timestamp": [i],
"event_type": ["reading"],
"value": [float(i)]
}
# Initialize the writer
# This will write to ./output_data/shard-0.parquet, ./output_data/shard-1.parquet, etc.
with ParquetStreamWriter("output_data", schema, overwrite=True) as writer:
for batch in data_stream():
writer.write_batch(batch)
Configuring file size and naming
You can configure when new files are created and how they are named. Additional PyArrow parameters can be passed through via **kwargs.
with ParquetStreamWriter(
"data_stream",
schema,
shard_size_bytes=50 * 1024 * 1024, # Shards will be approx. 50 MiB each
file_prefix="events", # Output name: events-0.parquet
compression="snappy"
) as writer:
for batch in stream:
writer.write_batch(batch)
Accessing created files
After the writer closes, you can inspect which files were actually created. This is useful for logging or triggering downstream processes.
with ParquetStreamWriter("output", schema) as writer:
writer.write_batch(batch_1)
writer.write_batch(batch_2)
# The 'writer' object stores a list of the files it created
print("Data was written to the following files:")
for file_path in writer.written_files:
print(file_path)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parquet_stream_writer-0.1.0.tar.gz.
File metadata
- Download URL: parquet_stream_writer-0.1.0.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2628edf077604ca40b409f09b6ff3eae864107c2a9394ea56ab52e79ffcedd67
|
|
| MD5 |
850a47b418ca21283189255a128d9de2
|
|
| BLAKE2b-256 |
4c2c8abc459d86d525f7b8d97fde8ea032020796daccfeafe61e18f4d35cdf78
|
File details
Details for the file parquet_stream_writer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: parquet_stream_writer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96785a6c3b5b3cdffa766b3cfc34612306420c9006873d0aa810c6bf014483e9
|
|
| MD5 |
a42ffee073288fcb0437c2f1c501b765
|
|
| BLAKE2b-256 |
4c7aa5107b73c624e02c2af1c1f3901d052cfd625daff0deea827e1d0b34e5ff
|