IO Bench is a library designed to benchmark the performance of standard flat file formats and partitioning schemes.
Project description
IOBench Quick Start Guide
Generating Sample Data
To generate sample data, initialize the IOBench
object with the path to the source CSV file and call the generate_sample
method:
from io_bench import IOBench
bench = IOBench(source_file='./data/source_100K.csv', runs=20, parsers=['avro', 'parquet_polars', 'parquet_arrow', 'parquet_fast', 'feather', 'feather_arrow'])
bench.generate_sample(records=100000) # default value
NOTE: source_file
behavior is contextual; providing a desired name for a sample file then calling generate_sample
will create the file. Otherwise a valid path to an existing file must be provided.
Converting Data to Partitioned Formats
Convert the generated CSV data to partitioned formats (Avro, Parquet, Feather) will automatically partition on default column selection chunks if not defined.
bench.partition(rows={'avro': 500000, 'parquet': 3000000, 'feather': 1600000})
Running Benchmarks
NOTE: Partition is stateful per bench object. If partition is not called manually it will automatically be called on the first run only assuming a valid source file exists.
Without Column Selection
Run benchmarks without column selection:
benchmarks_no_select = bench.run(suffix='_no_select')
With Column Selection
Run benchmarks with column selection:
columns = ['Region', 'Country', 'Total Cost']
benchmarks_column_select = bench.run(columns=columns, suffix='_column_select')
Generating Reports
Combine results and generate the final report:
all_benchmarks = benchmarks_no_select + benchmarks_column_select
io_bench.report(all_benchmarks, report_dir='./result')
Full Example
Here is a full example of using IOBench
:
from io_bench import IOBench
def main() -> None:
# Initialize the IOBench object with runs and parsers
bench = IOBench(source_file='./data/source_100K.csv', runs=20, parsers=['avro', 'parquet_polars'])
# Generate sample data - (optional)
bench.generate_sample()
# Convert the source file to partitioned formats - (optional)
bench.partition(rows={'avro': 500000, 'parquet': 3000000, 'feather': 1600000})
# Run benchmarks without column selection
benchmarks_no_select = bench.run(suffix='_no_select')
# Run benchmarks with column selection
columns = ['Region', 'Country', 'Total Cost']
benchmarks_column_select = bench.run(columns=columns, suffix='_column_select')
# Combine results and generate the final report
all_benchmarks = benchmarks_no_select + benchmarks_column_select
bench.report(all_benchmarks, report_dir='./result')
if __name__ == "__main__":
main()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.