Tiny Python module to bulk-convert large amounts of images into zarr files

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.11
Topic
- Scientific/Engineering
- Scientific/Engineering :: Astronomy

Project description

images_to_zarr

A Python module to efficiently bulk-convert large collections of heterogeneous images (FITS, PNG, JPEG, TIFF) into sharded Zarr v3 stores for fast analysis and cloud-native workflows.

Features

Multi-format support: FITS, PNG, JPEG, TIFF images
Consistent NCHW format: All images stored in (batch, channels, height, width) format for ML workflows
Direct memory conversion: Convert numpy arrays directly to Zarr without intermediate files
Efficient storage: Sharded Zarr v3 format with configurable compression
Metadata preservation: Combines image data with tabular metadata
Parallel processing: Multi-threaded conversion for large datasets
Cloud-friendly: S3-compatible storage backend
Visual inspection: Built-in plotting tools to sample and display stored images
Easy inspection: Built-in tools to analyze converted stores

Installation

From PyPI

pip install images-to-zarr

After installation, the CLI command images_to_zarr will be available system-wide.

From source

git clone https://github.com/gomezzz/images_to_zarr.git
cd images_to_zarr
pip install -e .

Using conda

conda env create -f environment.yml
conda activate img2zarr
pip install -e .

Quick Start

Command Line Interface

Convert image folders to Zarr:

# Basic conversion with metadata
images_to_zarr convert /path/to/images --metadata metadata.csv --out /output/dir

# Basic conversion without metadata (filenames only)
images_to_zarr convert /path/to/images --out /output/dir

# Convert images to Zarr with metadata
images_to_zarr convert /path/to/images --metadata metadata.csv --out /output/dir

# Convert without metadata (filenames only)
images_to_zarr convert /path/to/images --out /output/dir

# Advanced options with resize
images_to_zarr convert /path/to/images1 /path/to/images2 \
    --metadata metadata.csv \
    --out /output/dir \
    --recursive \
    --workers 16 \
    --fits-ext 0 \
    --chunk-shape 1,512,512 \
    --compressor zstd \
    --clevel 5 \
    --resize 256,256 \
    --interpolation-order 1 \
    --overwrite

Inspect a Zarr store:

images_to_zarr inspect /path/to/store.zarr

Python API

from images_to_zarr import convert, inspect, display_sample_images
from images_to_zarr.convert import convert_from_memory
import numpy as np
from pathlib import Path

# Convert images to Zarr with metadata
zarr_path = convert(
    folders=["/path/to/images"],
    recursive=True,
    metadata="/path/to/metadata.csv",  # Optional
    output_dir="/output/dir",
    num_parallel_workers=8,
    chunk_shape=(1, 256, 256),
    compressor="zstd",
    clevel=4
)

# Convert images to Zarr with automatic resizing
zarr_path = convert(
    folders=["/path/to/images"],
    recursive=True,
    metadata="/path/to/metadata.csv",  # Optional
    output_dir="/output/dir",
    resize=(256, 256),  # Resize all images to 256x256
    interpolation_order=1,  # Bi-linear interpolation
    num_parallel_workers=8,
    chunk_shape=(1, 256, 256),
    compressor="zstd",
    clevel=4
)

# Convert images to Zarr without metadata (filenames only)
zarr_path = convert(
    folders=["/path/to/images"],
    recursive=True,
    metadata=None,  # or simply omit this parameter
    output_dir="/output/dir"
)

# Convert numpy arrays directly to Zarr (memory-to-zarr conversion)
# Images must be in NCHW format: (batch, channels, height, width)
images = np.random.rand(100, 3, 224, 224).astype(np.float32)  # 100 RGB images
zarr_path = convert_from_memory(
    images=images,
    output_dir="/output/dir",
    compressor="lz4",
    overwrite=True
)

# Convert with custom metadata for memory conversion
metadata = [{"id": i, "source": "generated"} for i in range(100)]
zarr_path = convert(
    output_dir="/output/dir",
    images=images,
    image_metadata=metadata,
    chunk_shape=(10, 224, 224),  # Chunk 10 images together
    overwrite=True
)

# Inspect the result
inspect(zarr_path)

# Display random sample images from the store (with auto-normalization for .fits)
from images_to_zarr import display_sample_images
display_sample_images(zarr_path, num_samples=6, figsize=(15, 10))

# Save sample images to file
display_sample_images(zarr_path, num_samples=4, save_path="samples.png")

Usage

Metadata CSV Format

The metadata CSV file is optional. If provided, it must contain at least a filename column. Additional columns are preserved:

filename,source_id,ra,dec,magnitude
image001.fits,12345,123.456,45.678,18.5
image002.png,12346,124.567,46.789,19.2
image003.jpg,12347,125.678,47.890,17.8

If no metadata file is provided, metadata will be automatically created from the filenames:

# Convert without metadata - will use filenames only
images_to_zarr convert /path/to/images --out /output/dir

# Convert with metadata
images_to_zarr convert /path/to/images --metadata metadata.csv --out /output/dir

Supported Image Formats

FITS (.fits, .fit): Astronomical images with flexible HDU support
PNG (.png): Lossless compressed images
JPEG (.jpg, .jpeg): Compressed photographic images
TIFF (.tif, .tiff): Uncompressed or losslessly compressed images

FITS Extension Handling

# Use primary HDU (default)
convert(..., fits_extension=None)

# Use specific extension by number
convert(..., fits_extension=1)

# Use extension by name
convert(..., fits_extension="SCI")

# Combine multiple extensions
convert(..., fits_extension=[0, 1, "ERR"])

Image Resizing

When dealing with images of different sizes, you can use the resize functionality:

# Resize all images to 512x512 using bi-linear interpolation
convert(
    folders=["/path/to/images"],
    output_dir="/output/dir",
    resize=(512, 512),
    interpolation_order=1  # 0=nearest, 1=linear, 2=quadratic, etc.
)

# If resize is not specified, all images must have the same dimensions
# or an error will be raised

Interpolation orders:

0: Nearest-neighbor (fastest, lowest quality)
1: Bi-linear (default, good balance)
2: Bi-quadratic
3: Bi-cubic (slower, higher quality)
4: Bi-quartic
5: Bi-quintic (slowest, highest quality)

Configuration Options

Parameter	Description	Default
`chunk_shape`	Zarr chunk dimensions (n_images, height, width)	(1, 256, 256)
`compressor`	Compression codec (zstd, lz4, gzip, etc.)	"lz4"
`clevel`	Compression level (1-9)	1
`num_parallel_workers`	Number of processing threads	8
`recursive`	Scan subdirectories recursively	False
`fits_extension`	FITS HDU(s) to read (int, str, or sequence)	None (uses 0)
`resize`	Resize images to (height, width)	None
`interpolation_order`	Resize interpolation order (0-5)	1 (bi-linear)
`overwrite`	Overwrite existing store if present	False

Output Structure

output_dir/
├── images.zarr/              # Main Zarr store (if output_dir doesn't end with .zarr)
│   ├── images/              # Image data arrays
│   └── .zarray, .zgroup     # Zarr metadata
└── images_metadata.parquet  # Combined metadata

Note: If you specify an output directory ending with .zarr (e.g., /path/to/my_dataset.zarr), that path will be used directly as the Zarr store, creating a cleaner output structure.


### Zarr Store Contents

- **`images`**: Main array containing all image data
- **Attributes**: Store metadata, compression info, creation parameters
- **Chunks**: Sharded for efficient cloud access

### Metadata Parquet

Combined metadata includes:
- Original CSV columns
- Image-specific metadata (dimensions, dtype, file size)
- Processing statistics (min/max/mean values)

## Performance Tips

1. **Chunk size**: Match your typical access patterns
   - Single image access: `(1, H, W)`
   - Batch processing: `(B, H, W)` where B > 1

2. **Compression**: Balance speed vs. size
   - Fast: `lz4` with low compression level
   - Compact: `zstd` with high compression level

3. **Parallelism**: Scale with your I/O capacity
   - Local SSD: 8-16 workers
   - Network storage: 4-8 workers
   - S3: 16-32 workers

4. **Memory**: Monitor for large images
   - Consider smaller chunk sizes for very large images
   - Reduce batch size if memory usage is high

## Inspection Output Example

================================================================================ SUMMARY STATISTICS

Total images across all files: 104,857,600 Total storage size: 126,743.31 MB Image dimensions: (3, 256, 256) Data type: uint8 Compression: lz4 (level 1)

Format distribution: FITS: 60,000,000 (57.2%) PNG: 30,000,000 (28.6%) JPEG: 10,000,000 (9.5%) TIFF: 4,857,600 (4.6%)

Original data type distribution: uint8: 78.0% int16: 12.0% float32: 10.0%


## Image Display and Visualization

The `display_sample_images` function provides intelligent visualization with automatic normalization:

```python
from images_to_zarr import display_sample_images

# Display with automatic normalization (handles .fits files with arbitrary ranges)
display_sample_images("/path/to/store.zarr", num_samples=6)

Error Handling

The library provides robust error handling:

Missing files: Warnings logged, processing continues
Corrupted images: Replaced with zero arrays, errors recorded in metadata
Incompatible formats: Clear error messages with suggested fixes
Storage issues: Detailed error reporting for disk/network problems

Logging Configuration

from images_to_zarr import configure_logging

# Enable detailed logging
configure_logging(enable=True, level="DEBUG")

# Disable for production
configure_logging(enable=False)

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

git clone https://github.com/username/images_to_zarr.git
cd images_to_zarr
conda env create -f environment.yml
conda activate img2zarr
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black .

# Check linting
flake8

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on Zarr for array storage
Uses Astropy for FITS support
Inspired by the needs of astronomical data processing pipelines

Channel Order and Format Consistency

All images are automatically converted to NCHW format (batch, channels, height, width) for consistency across different input formats:

2D grayscale: (H, W) → (1, 1, H, W)
3D RGB (HWC): (H, W, C) → (1, C, H, W)
3D CHW: (C, H, W) → (1, C, H, W)
4D batched: Already in NCHW format

The library intelligently detects the input format:

Images with ≤4 channels in the last dimension are treated as HWC (Height-Width-Channels)
Images with >4 channels in the last dimension are treated as CHW (Channels-Height-Width)
FITS files and other scientific formats are handled appropriately

This ensures consistent tensor shapes for machine learning workflows while preserving the original data.

Direct Memory Conversion

Convert numpy arrays directly to Zarr without saving intermediate files:

import numpy as np
from images_to_zarr.convert import convert_from_memory

# Your image data (must be 4D NCHW format)
images = np.random.rand(1000, 3, 256, 256).astype(np.float32)

# Convert directly to zarr
zarr_path = convert_from_memory(
    images=images,
    output_dir="./data",
    compressor="lz4",
    chunk_shape=(100, 256, 256),  # Chunk 100 images together
    overwrite=True
)

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.11
Topic
- Scientific/Engineering
- Scientific/Engineering :: Astronomy

Release history Release notifications | RSS feed

0.3.5

Nov 19, 2025

0.3.4

Nov 12, 2025

0.3.3

Sep 24, 2025

0.3.2

Sep 3, 2025

0.3.1

Sep 3, 2025

This version

0.3.0

Jun 13, 2025

0.2.0

Jun 6, 2025

0.1.0

May 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

images_to_zarr-0.3.0.tar.gz (73.8 kB view details)

Uploaded Jun 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

images_to_zarr-0.3.0-py3-none-any.whl (46.8 kB view details)

Uploaded Jun 13, 2025 Python 3

File details

Details for the file images_to_zarr-0.3.0.tar.gz.

File metadata

Download URL: images_to_zarr-0.3.0.tar.gz
Upload date: Jun 13, 2025
Size: 73.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for images_to_zarr-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`eef91be26ca00250bb794cd81189fc699767800a09a140dc7a7636d2c1e6bd49`
MD5	`ba902c5bad729eb0dc8bbc71f8e4a748`
BLAKE2b-256	`82bc46cdf16149e1a0504d2bc95cee34d70c2c03ce7a305d4f208411e01107e1`

See more details on using hashes here.

File details

Details for the file images_to_zarr-0.3.0-py3-none-any.whl.

File metadata

Download URL: images_to_zarr-0.3.0-py3-none-any.whl
Upload date: Jun 13, 2025
Size: 46.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for images_to_zarr-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e77ae5db3bc95bc60b6938b762850a537699eee18b624ed5c92d2211a5ee7216`
MD5	`864138478d94f6fcf83f1f7659cbe933`
BLAKE2b-256	`c3c5065f8bd058b7aa303b0da27c2e73bf1371a5201b1079376e3fee093e46b4`

See more details on using hashes here.

images-to-zarr 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

images_to_zarr

Features

Installation

From PyPI

From source

Using conda

Quick Start

Command Line Interface

Python API

Usage

Metadata CSV Format

Supported Image Formats

FITS Extension Handling

Image Resizing

Configuration Options

Output Structure

================================================================================ SUMMARY STATISTICS

Original data type distribution: uint8: 78.0% int16: 12.0% float32: 10.0%

Error Handling

Logging Configuration

Contributing

Development Setup

License

Acknowledgments

Channel Order and Format Consistency

Direct Memory Conversion

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes