Benchmark python coroutines in various ways
Project description
tiff-dumper
Dump TIFF headers into .parquet files using obstore and async-tiff:
- List a bucket.
- Fetch and parse the header of each TIFF we find.
- Save the headers to parquet.
This library can process ~4,000 images per second with a single python thread (managing multiple rust threads).
Usage
The library provides a CLI which accepts a YAML config file, see config.yaml for an example.
❯ pip install tiff-dumper
❯ mkdir outputs
❯ tiff-dumper headers config.yaml outputs
The output parquet files have the following schema:
│ COLUMN │ TYPE │ ANNOTATION │ REPETITION │ COMPRESSION │
├────────────────────────────┼────────┼────────────┼────────────┼─────────────┤
│ artist │ int32 │ null │ 0..1 │ snappy │
│ bits_per_sample │ │ list │ 0..1 │ │
│ compression │ int64 │ │ 0..1 │ snappy │
│ copyright │ int32 │ null │ 0..1 │ snappy │
│ date_time │ int32 │ null │ 0..1 │ snappy │
│ document_name │ int32 │ null │ 0..1 │ snappy │
│ extra_samples │ int32 │ null │ 0..1 │ snappy │
│ host_computer │ int32 │ null │ 0..1 │ snappy │
│ image_description │ int32 │ null │ 0..1 │ snappy │
│ image_height │ int64 │ │ 0..1 │ snappy │
│ image_width │ int64 │ │ 0..1 │ snappy │
│ jpeg_tables │ int32 │ null │ 0..1 │ snappy │
│ max_sample_value │ int32 │ null │ 0..1 │ snappy │
│ min_sample_value │ int32 │ null │ 0..1 │ snappy │
│ model_pixel_scale │ │ list │ 0..1 │ │
│ model_tiepoint │ │ list │ 0..1 │ │
│ new_subfile_type │ int32 │ null │ 0..1 │ snappy │
│ orientation │ int32 │ null │ 0..1 │ snappy │
│ other_tags │ │ group │ 0..1 │ │
│ photometric_interpretation │ int64 │ │ 0..1 │ snappy │
│ planar_configuration │ int64 │ │ 0..1 │ snappy │
│ predictor │ int64 │ │ 0..1 │ snappy │
│ resolution_unit │ int32 │ null │ 0..1 │ snappy │
│ rows_per_strip │ int32 │ null │ 0..1 │ snappy │
│ sample_format │ │ list │ 0..1 │ │
│ samples_per_pixel │ int64 │ │ 0..1 │ snappy │
│ software │ int32 │ null │ 0..1 │ snappy │
│ tile_height │ int64 │ │ 0..1 │ snappy │
│ tile_width │ int64 │ │ 0..1 │ snappy │
│ x_resolution │ int32 │ null │ 0..1 │ snappy │
│ y_resolution │ int32 │ null │ 0..1 │ snappy │
│ geokeys │ │ group │ 0..1 │ │
│ path │ binary │ string │ 0..1 │ snappy │
├────────────────────────────┼────────┴────────────┴────────────┴─────────────┤
│ Rows │ 4329 │
│ Row Groups │ 1 │
Performance
The config file checked into the repo lists 235,639 TIFFs and takes 58 seconds on a m5.8xlarge, coming out to ~4,000 TIFF headers per second.
Why is this so fast?
The library uses anyio streams to efficiently move data between coroutines. Each stream has a sending end, and a receivine end; which when combined act as a "bounded queue". Keys are placed onto the stream as we scan the bucket, and are processed by N number of consumers listening to the stream. Each consumer (a python coroutine) is responsbile for fetching/parsing the TIFF header and placing the results on the output stream. A single coroutine listens to the output stream and writes the headers to parquet.
This provides decoupling between listing the bucket, reading/parsing TIFF headers, and writing to parquet; allowing these things to scale independendently of each other. This decoupling allows us to more efficiently saturate the host machine's network bandwidth compared to the simpler approach of processing each page of LIST requests as we receive them (ex. with asyncio.gather).
The diagramn below shows the high-level architecture:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tiff_dumper-0.1.0.tar.gz.
File metadata
- Download URL: tiff_dumper-0.1.0.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.11.4 Darwin/22.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e32834ddbd0264c6e3ba78913c21ae2701bd68e07b233d041792e7e188b9a36
|
|
| MD5 |
2fde40ba6a0d6e22edf29b17a06da41f
|
|
| BLAKE2b-256 |
dd50790124f65443faf1b5a2a6a4f6244adcb99b6dc591348baf1a62dc7ac015
|
File details
Details for the file tiff_dumper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tiff_dumper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.11.4 Darwin/22.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bf18d4ed6ebea025320ce439d417980be39c515271e7bbf95ce9de642c6a6dc
|
|
| MD5 |
fc86f2f6ae2aedb5acee1fe6c31c339f
|
|
| BLAKE2b-256 |
e496edeed2a8ef1beb5611398f6c2bad3a9660949660afbeafbce2e31493508d
|