A high-performance array storage and manipulation library
Project description
NumPack
NumPack is a high-performance array storage library that combines Rust's performance with Python's ease of use. It provides exceptional performance for both reading and writing large NumPy arrays, with special optimizations for in-place modifications.
Key Features
- 🚀 397x faster row replacement than NPY
- ⚡ 405x faster data append than NPY
- 💨 54x faster lazy loading than NPY mmap
- 📖 1.3x faster full data loading than NPY
- 🔄 174x speedup with Writable Batch Mode for frequent modifications
- 💾 Zero-copy operations with minimal memory footprint
- 🛠 Seamless integration with existing NumPy workflows
Features
- High Performance: Optimized for both reading and writing large numerical arrays
- Lazy Loading Support: Efficient memory usage through on-demand data loading
- In-place Operations: Support for in-place array modifications without full file rewrite
- Batch Processing Modes:
- Batch Mode: 25-37x speedup for batch operations
- Writable Batch Mode: 174x speedup for frequent modifications
- Multiple Data Types: Supports various numerical data types including:
- Boolean
- Unsigned integers (8-bit to 64-bit)
- Signed integers (8-bit to 64-bit)
- Floating point (16-bit, 32-bit and 64-bit)
- Complex numbers (64-bit and 128-bit)
Installation
From PyPI (Recommended)
Prerequisites
- Python >= 3.9
- NumPy >= 1.26.0
pip install numpack
From Source
Prerequisites (All Platforms including Windows)
- Python >= 3.9
- Rust >= 1.70.0 (Required on all platforms, install from rustup.rs)
- NumPy >= 1.26.0
- Appropriate C/C++ compiler
- Windows: Microsoft C++ Build Tools
- macOS: Xcode Command Line Tools (
xcode-select --install) - Linux: GCC/Clang (
build-essentialon Ubuntu/Debian)
Build Steps
- Clone the repository:
git clone https://github.com/BirchKwok/NumPack.git
cd NumPack
- Install maturin:
pip install maturin>=1.0,<2.0
- Build and install:
# Install in development mode
maturin develop
# Or build wheel package
maturin build --release
pip install target/wheels/numpack-*.whl
Usage
Basic Operations
import numpy as np
from numpack import NumPack
# Using context manager (Recommended)
with NumPack("data_directory") as npk:
# Save arrays
arrays = {
'array1': np.random.rand(1000, 100).astype(np.float32),
'array2': np.random.rand(500, 200).astype(np.float32)
}
npk.save(arrays)
# Load arrays - Normal mode
loaded = npk.load("array1")
# Load arrays - Lazy mode
lazy_array = npk.load("array1", lazy=True)
Advanced Operations
with NumPack("data_directory") as npk:
# Replace specific rows
replacement = np.random.rand(10, 100).astype(np.float32)
npk.replace({'array1': replacement}, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# Append new data
new_data = {'array1': np.random.rand(100, 100).astype(np.float32)}
npk.append(new_data)
# Drop arrays or specific rows
npk.drop('array1') # Drop entire array
npk.drop('array2', [0, 1, 2]) # Drop specific rows
# Random access operations
data = npk.getitem('array1', [0, 1, 2])
data = npk['array1'] # Dictionary-style access
# Stream loading for large arrays
for batch in npk.stream_load('array1', buffer_size=1000):
process_batch(batch)
Batch Processing Modes
NumPack provides two high-performance batch modes for scenarios with frequent modifications:
Batch Mode (25-37x speedup)
with NumPack("data.npk") as npk:
with npk.batch_mode():
for i in range(1000):
arr = npk.load('data') # Load from cache
arr[:10] *= 2.0
npk.save({'data': arr}) # Save to cache
# All changes written to disk on exit
Writable Batch Mode (174x speedup)
with NumPack("data.npk") as npk:
with npk.writable_batch_mode() as wb:
for i in range(1000):
arr = wb.load('data') # Memory-mapped view
arr[:10] *= 2.0 # Direct modification
# No save needed - changes are automatic
Performance
All benchmarks were conducted on macOS (Apple Silicon) using the Rust backend with precise timeit measurements.
Performance Comparison (1M rows × 10 columns, Float32, 38.1MB)
| Operation | NumPack | NPY | NPZ | Zarr | HDF5 | Parquet | NumPack Advantage |
|---|---|---|---|---|---|---|---|
| Full Load | 8.27ms 🥇 | 10.51ms | 181.62ms | 41.40ms | 58.39ms | 23.74ms | 1.3x vs NPY |
| Lazy Load | 0.002ms 🥇 | 0.107ms | N/A | 0.397ms | 0.080ms | N/A | 54x vs NPY |
| Replace 100 rows | 0.047ms 🥇 | 18.51ms | 1574ms | 9.08ms | 0.299ms | 187.65ms | 397x vs NPY |
| Append 100 rows | 0.067ms 🥇 | 27.09ms | 1582ms | 9.98ms | 0.212ms | 204.74ms | 405x vs NPY |
| Random Access (1K) | 0.051ms | 0.010ms 🥇 | 183.16ms | 3.46ms | 4.91ms | 22.80ms | 26x vs NPZ |
| Save | 16.15ms | 7.19ms 🥇 | 1378ms | 80.91ms | 55.66ms | 159.14ms | 2.2x slower |
Performance Comparison (100K rows × 10 columns, Float32, 3.8MB)
| Operation | NumPack | NPY | NPZ | Zarr | HDF5 | NumPack Advantage |
|---|---|---|---|---|---|---|
| Full Load | 0.98ms | 0.66ms 🥇 | 18.65ms | 6.24ms | 6.35ms | 1.5x slower |
| Lazy Load | 0.002ms 🥇 | 0.103ms | N/A | 0.444ms | 0.085ms | 51x vs NPY |
| Replace 100 rows | 0.039ms 🥇 | 2.13ms | 159.19ms | 4.39ms | 0.208ms | 55x vs NPY |
| Append 100 rows | 0.059ms 🥇 | 3.29ms | 159.19ms | 4.59ms | 0.206ms | 56x vs NPY |
| Random Access (1K) | 0.116ms | 0.010ms 🥇 | 18.73ms | 1.89ms | 4.82ms | 12x vs NPZ |
Batch Mode Performance (1M rows × 10 columns)
100 consecutive modify operations:
| Mode | Time | Speedup |
|---|---|---|
| Normal Mode | 856ms | 1.0x |
| Batch Mode | 34ms | 25x faster 🔥 |
| Writable Batch Mode | 4.9ms | 174x faster 🔥🔥 |
Key Performance Highlights
-
Data Modification - Exceptional Performance 🏆
- Replace operations: 397x faster than NPY (large dataset)
- Append operations: 405x faster than NPY (large dataset)
- Supports efficient in-place modification without full file rewrite
- NumPack's core advantage
-
Data Loading - Industry Leading
- Full load: Fastest for large datasets (8.27ms)
- Lazy load: 54x faster than NPY mmap (0.002ms)
- Optimized batch data transfer with SIMD acceleration
-
Batch Processing - Revolutionary Performance
- Batch Mode: 25-37x speedup for batch operations
- Writable Batch Mode: 174x speedup for frequent modifications
- Ideal for machine learning pipelines and data processing workflows
-
Storage Efficiency
- File size identical to NPY
- ~10% smaller than Zarr/NPZ (compressed formats)
When to Use NumPack
✅ Strongly Recommended (90% of use cases):
- Machine learning and deep learning pipelines
- Real-time data stream processing
- Data annotation and correction workflows
- Feature stores with dynamic updates
- Any scenario requiring frequent data modifications
- Fast data loading requirements
⚠️ Consider Alternatives (10% of use cases):
- Write-once, never modify → Use NPY (faster initial write)
- Frequent single-row access → Use NPY mmap
- Extreme compression requirements → Use NPZ (10% smaller, but 1000x slower)
Best Practices
1. Use Writable Batch Mode for Frequent Modifications
# 174x speedup for frequent modifications
with NumPack("data.npk") as npk:
with npk.writable_batch_mode() as wb:
for i in range(1000):
arr = wb.load('data')
arr[:10] *= 2.0
# Automatic persistence on exit
2. Use Batch Mode for Batch Operations
# 25-37x speedup for batch processing
with NumPack("data.npk") as npk:
with npk.batch_mode():
for i in range(1000):
arr = npk.load('data')
arr[:10] *= 2.0
npk.save({'data': arr})
# Single write on exit
3. Use Lazy Loading for Large Datasets
with NumPack("large_data.npk") as npk:
# Only 0.002ms to initialize
lazy_array = npk.load("array", lazy=True)
# Data loaded on demand
subset = lazy_array[1000:2000]
4. Reuse NumPack Instances
# ✅ Efficient: Reuse instance
with NumPack("data.npk") as npk:
for i in range(100):
data = npk.load('array')
# ❌ Inefficient: Create new instance each time
for i in range(100):
with NumPack("data.npk") as npk:
data = npk.load('array')
Benchmark Methodology
All benchmarks use:
timeitfor precise timing- Multiple repeats, best time selected
- Pure operation time (excluding file open/close overhead)
- Float32 arrays
- macOS Apple Silicon (results may vary by platform)
For complete benchmark code, see comprehensive_format_benchmark.py.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.
Copyright 2024 NumPack Contributors
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file numpack-0.4.0-cp310-cp310-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: numpack-0.4.0-cp310-cp310-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 616.3 kB
- Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
027c84fb054bc7d95dd8b6c0c50be0ec549e389e153c401f166f6b750d49d889
|
|
| MD5 |
0f2d765f67e5d9338025a6a7c35682e4
|
|
| BLAKE2b-256 |
b21d0c54074e6cf979c7616e96cefbafa79a1dd6ef56aa3a4590155b81220a10
|
File details
Details for the file numpack-0.4.0-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: numpack-0.4.0-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 530.5 kB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9396afba7e33d9e23835f48f47c6f196d66ae9db6409cc53932968c59ccd559f
|
|
| MD5 |
84ef953b4b9e77cbf4e7e18d618dd060
|
|
| BLAKE2b-256 |
3d27157235bd29c2701a9e0f80925523847b46aaed0478a0e3951edc3ae3342a
|
File details
Details for the file numpack-0.4.0-cp310-cp310-macosx_10_12_x86_64.whl.
File metadata
- Download URL: numpack-0.4.0-cp310-cp310-macosx_10_12_x86_64.whl
- Upload date:
- Size: 587.7 kB
- Tags: CPython 3.10, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05638976ceb089ada092d309385468777d6d7be9567949d6bd0c1f40bfd2c6ea
|
|
| MD5 |
9a087956dc84750e8a089f8835aa11c1
|
|
| BLAKE2b-256 |
f5f909d847a71b56e32a09de794fafbc0129fcb3d1ef575586fdc8d55f3a57b7
|