Skip to main content

A high-performance array storage and manipulation library

Project description

NumPack

NumPack is a high-performance array storage library that combines Rust's performance with Python's ease of use. It provides exceptional performance for both reading and writing large NumPy arrays, with special optimizations for in-place modifications.

Key Features

  • 🚀 397x faster row replacement than NPY
  • 405x faster data append than NPY
  • 💨 54x faster lazy loading than NPY mmap
  • 📖 1.3x faster full data loading than NPY
  • 🔄 174x speedup with Writable Batch Mode for frequent modifications
  • 💾 Zero-copy operations with minimal memory footprint
  • 🛠 Seamless integration with existing NumPy workflows

Features

  • High Performance: Optimized for both reading and writing large numerical arrays
  • Lazy Loading Support: Efficient memory usage through on-demand data loading
  • In-place Operations: Support for in-place array modifications without full file rewrite
  • Batch Processing Modes:
    • Batch Mode: 25-37x speedup for batch operations
    • Writable Batch Mode: 174x speedup for frequent modifications
  • Multiple Data Types: Supports various numerical data types including:
    • Boolean
    • Unsigned integers (8-bit to 64-bit)
    • Signed integers (8-bit to 64-bit)
    • Floating point (16-bit, 32-bit and 64-bit)
    • Complex numbers (64-bit and 128-bit)

Installation

From PyPI (Recommended)

Prerequisites

  • Python >= 3.9
  • NumPy >= 1.26.0
pip install numpack

From Source

Prerequisites (All Platforms including Windows)

  • Python >= 3.9
  • Rust >= 1.70.0 (Required on all platforms, install from rustup.rs)
  • NumPy >= 1.26.0
  • Appropriate C/C++ compiler
    • Windows: Microsoft C++ Build Tools
    • macOS: Xcode Command Line Tools (xcode-select --install)
    • Linux: GCC/Clang (build-essential on Ubuntu/Debian)

Build Steps

  1. Clone the repository:
git clone https://github.com/BirchKwok/NumPack.git
cd NumPack
  1. Install maturin:
pip install maturin>=1.0,<2.0
  1. Build and install:
# Install in development mode
maturin develop

# Or build wheel package
maturin build --release
pip install target/wheels/numpack-*.whl

Usage

Basic Operations

import numpy as np
from numpack import NumPack

# Using context manager (Recommended)
with NumPack("data_directory") as npk:
    # Save arrays
    arrays = {
        'array1': np.random.rand(1000, 100).astype(np.float32),
        'array2': np.random.rand(500, 200).astype(np.float32)
    }
    npk.save(arrays)
    
    # Load arrays - Normal mode
    loaded = npk.load("array1")
    
    # Load arrays - Lazy mode
    lazy_array = npk.load("array1", lazy=True)

Advanced Operations

with NumPack("data_directory") as npk:
    # Replace specific rows
    replacement = np.random.rand(10, 100).astype(np.float32)
    npk.replace({'array1': replacement}, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    
    # Append new data
    new_data = {'array1': np.random.rand(100, 100).astype(np.float32)}
    npk.append(new_data)
    
    # Drop arrays or specific rows
    npk.drop('array1')  # Drop entire array
    npk.drop('array2', [0, 1, 2])  # Drop specific rows
    
    # Random access operations
    data = npk.getitem('array1', [0, 1, 2])
    data = npk['array1']  # Dictionary-style access
    
    # Stream loading for large arrays
    for batch in npk.stream_load('array1', buffer_size=1000):
        process_batch(batch)

Batch Processing Modes

NumPack provides two high-performance batch modes for scenarios with frequent modifications:

Batch Mode (25-37x speedup)

with NumPack("data.npk") as npk:
    with npk.batch_mode():
        for i in range(1000):
            arr = npk.load('data')      # Load from cache
            arr[:10] *= 2.0
            npk.save({'data': arr})     # Save to cache
# All changes written to disk on exit

Writable Batch Mode (174x speedup)

with NumPack("data.npk") as npk:
    with npk.writable_batch_mode() as wb:
        for i in range(1000):
            arr = wb.load('data')   # Memory-mapped view
            arr[:10] *= 2.0         # Direct modification
            # No save needed - changes are automatic

Performance

All benchmarks were conducted on macOS (Apple Silicon) using the Rust backend with precise timeit measurements.

Performance Comparison (1M rows × 10 columns, Float32, 38.1MB)

Operation NumPack NPY NPZ Zarr HDF5 Parquet NumPack Advantage
Full Load 8.27ms 🥇 10.51ms 181.62ms 41.40ms 58.39ms 23.74ms 1.3x vs NPY
Lazy Load 0.002ms 🥇 0.107ms N/A 0.397ms 0.080ms N/A 54x vs NPY
Replace 100 rows 0.047ms 🥇 18.51ms 1574ms 9.08ms 0.299ms 187.65ms 397x vs NPY
Append 100 rows 0.067ms 🥇 27.09ms 1582ms 9.98ms 0.212ms 204.74ms 405x vs NPY
Random Access (1K) 0.051ms 0.010ms 🥇 183.16ms 3.46ms 4.91ms 22.80ms 26x vs NPZ
Save 16.15ms 7.19ms 🥇 1378ms 80.91ms 55.66ms 159.14ms 2.2x slower

Performance Comparison (100K rows × 10 columns, Float32, 3.8MB)

Operation NumPack NPY NPZ Zarr HDF5 NumPack Advantage
Full Load 0.98ms 0.66ms 🥇 18.65ms 6.24ms 6.35ms 1.5x slower
Lazy Load 0.002ms 🥇 0.103ms N/A 0.444ms 0.085ms 51x vs NPY
Replace 100 rows 0.039ms 🥇 2.13ms 159.19ms 4.39ms 0.208ms 55x vs NPY
Append 100 rows 0.059ms 🥇 3.29ms 159.19ms 4.59ms 0.206ms 56x vs NPY
Random Access (1K) 0.116ms 0.010ms 🥇 18.73ms 1.89ms 4.82ms 12x vs NPZ

Batch Mode Performance (1M rows × 10 columns)

100 consecutive modify operations:

Mode Time Speedup
Normal Mode 856ms 1.0x
Batch Mode 34ms 25x faster 🔥
Writable Batch Mode 4.9ms 174x faster 🔥🔥

Key Performance Highlights

  1. Data Modification - Exceptional Performance 🏆

    • Replace operations: 397x faster than NPY (large dataset)
    • Append operations: 405x faster than NPY (large dataset)
    • Supports efficient in-place modification without full file rewrite
    • NumPack's core advantage
  2. Data Loading - Industry Leading

    • Full load: Fastest for large datasets (8.27ms)
    • Lazy load: 54x faster than NPY mmap (0.002ms)
    • Optimized batch data transfer with SIMD acceleration
  3. Batch Processing - Revolutionary Performance

    • Batch Mode: 25-37x speedup for batch operations
    • Writable Batch Mode: 174x speedup for frequent modifications
    • Ideal for machine learning pipelines and data processing workflows
  4. Storage Efficiency

    • File size identical to NPY
    • ~10% smaller than Zarr/NPZ (compressed formats)

When to Use NumPack

Strongly Recommended (90% of use cases):

  • Machine learning and deep learning pipelines
  • Real-time data stream processing
  • Data annotation and correction workflows
  • Feature stores with dynamic updates
  • Any scenario requiring frequent data modifications
  • Fast data loading requirements

⚠️ Consider Alternatives (10% of use cases):

  • Write-once, never modify → Use NPY (faster initial write)
  • Frequent single-row access → Use NPY mmap
  • Extreme compression requirements → Use NPZ (10% smaller, but 1000x slower)

Best Practices

1. Use Writable Batch Mode for Frequent Modifications

# 174x speedup for frequent modifications
with NumPack("data.npk") as npk:
    with npk.writable_batch_mode() as wb:
        for i in range(1000):
            arr = wb.load('data')
            arr[:10] *= 2.0
# Automatic persistence on exit

2. Use Batch Mode for Batch Operations

# 25-37x speedup for batch processing
with NumPack("data.npk") as npk:
    with npk.batch_mode():
        for i in range(1000):
            arr = npk.load('data')
            arr[:10] *= 2.0
            npk.save({'data': arr})
# Single write on exit

3. Use Lazy Loading for Large Datasets

with NumPack("large_data.npk") as npk:
    # Only 0.002ms to initialize
    lazy_array = npk.load("array", lazy=True)
    # Data loaded on demand
    subset = lazy_array[1000:2000]

4. Reuse NumPack Instances

# ✅ Efficient: Reuse instance
with NumPack("data.npk") as npk:
    for i in range(100):
        data = npk.load('array')

# ❌ Inefficient: Create new instance each time
for i in range(100):
    with NumPack("data.npk") as npk:
        data = npk.load('array')

Benchmark Methodology

All benchmarks use:

  • timeit for precise timing
  • Multiple repeats, best time selected
  • Pure operation time (excluding file open/close overhead)
  • Float32 arrays
  • macOS Apple Silicon (results may vary by platform)

For complete benchmark code, see comprehensive_format_benchmark.py.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.

Copyright 2024 NumPack Contributors

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

numpack-0.4.0-cp310-cp310-manylinux_2_34_x86_64.whl (616.3 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

numpack-0.4.0-cp310-cp310-macosx_11_0_arm64.whl (530.5 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

numpack-0.4.0-cp310-cp310-macosx_10_12_x86_64.whl (587.7 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file numpack-0.4.0-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for numpack-0.4.0-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 027c84fb054bc7d95dd8b6c0c50be0ec549e389e153c401f166f6b750d49d889
MD5 0f2d765f67e5d9338025a6a7c35682e4
BLAKE2b-256 b21d0c54074e6cf979c7616e96cefbafa79a1dd6ef56aa3a4590155b81220a10

See more details on using hashes here.

File details

Details for the file numpack-0.4.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for numpack-0.4.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9396afba7e33d9e23835f48f47c6f196d66ae9db6409cc53932968c59ccd559f
MD5 84ef953b4b9e77cbf4e7e18d618dd060
BLAKE2b-256 3d27157235bd29c2701a9e0f80925523847b46aaed0478a0e3951edc3ae3342a

See more details on using hashes here.

File details

Details for the file numpack-0.4.0-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for numpack-0.4.0-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 05638976ceb089ada092d309385468777d6d7be9567949d6bd0c1f40bfd2c6ea
MD5 9a087956dc84750e8a089f8835aa11c1
BLAKE2b-256 f5f909d847a71b56e32a09de794fafbc0129fcb3d1ef575586fdc8d55f3a57b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page