Numpy memory-mapped array with direct I/O. No sequential read caching, increase the speed of random read.
Project description
direct-mmap
Introduction
Numpy memory-mapped array with direct I/O. No sequential read caching, increase the speed of random read.
This package can increase speed a lot if you do random read in a memory-mapped array.
In general, when we read a file, the operating system will cache the read data in memory. For normal usage, this can increase the speed of frequent read of same data. However, when we random access the data, it will do sequential prefetching (but these data will not be used), which causes a lot of unnecessary reads. This package can avoid this by using direct I/O.
Installation
Requirements
Linux, x86_64 (because direct I/O is only supported on Linux). Python version >= 3.9.
Method 1: pip
pip install direct-mmap
Method 2: build from source
Install python3-dev (change to your python version, like python3.9-dev) first. This can be done by sudo add-apt-repository ppa:deadsnakes/ppa; sudo apt update; sudo apt install python3-dev
(change to your python version) on Ubuntu.
Then run the following command:
git clone git@github.com:jtc1246/direct-mmap.git
cd direct-mmap/direct_mmap/cpp/
Then change the Makefile, python3.9-config --includes
and python3.9-config --ldflags
to your python version.
Then run the following command:
make
cd ../..
python setup.py install
Usage
from direct_mmap import direct_mmap
path = './test.npy'
mmap = direct_mmap(path, (10000,200), 'uint64', offset=0)
a = mmap[10:1000:10, 15:30]
The subscription method is generally similar to numpy, but currently it doesn't support using a numpy bool array. This will return a numpy array, not a view.
Currently all the shape and choice of data is handled in python, so it would be very slow if there is too much data segments (i.e. too much uncontinuous data parts). So for suggestion, you can remove the last dimension from subscription, and use numpy to select later.
Performance
I run a benchmark with direct-mmap and np.memmap. The performance of direct-mmap is much better than np.memmap.
Each task is following:
mmap = direct_mmap(file, (170640000, 40), 'uint64', offset=64) # direct_mmap
mmap = np.memmap(file, shape=(170640000, 40), dtype='uint64', offset=64) # np.memmap
mmap[id:id + 300000:1000]
This file is about 54 GB. There are 3 test cases. The first one is read 1000 times in single thread, the second is read 1000 times in 64 threads, the third is read 10000 times in 64 threads.
Results:
np.memmap | direct_mmap | |
---|---|---|
1000 tasks, single thread | 90.38s | 36s |
1000 tasks, 64 threads | 9.24s | 0.98s |
10000 tasks, 64 threads | 20.29s | 9.4s |
Environment is following:
- Disk: Samsung PM983, 3.84 TB, PCIe, 4K 64 threads random read about 500K IOPS, single thread sequential read 910 MB/s
- System: Ubuntu 22.04, in docker ubuntu 20.04, python 3.9.18, 64G RAM
- Connection: PCIe 3.0 x4, 4 GB/s
- Drop cache: "sudo sync; sudo sysctl -w vm.drop_caches=3", before np.memmap testing
The testing code is in direct_mmap/main.py at last.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file direct_mmap-1.0.1.tar.gz
.
File metadata
- Download URL: direct_mmap-1.0.1.tar.gz
- Upload date:
- Size: 18.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.31.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 157660467c8bf37dfe27dcc70b9d9633e05043262962effae686fc9678ca9a1f |
|
MD5 | 2f12eb75cdc1be58765774a7a4cbe5d9 |
|
BLAKE2b-256 | 3c680a674bae4c8f08f053ace860d647e7ae264db0f17e2814ab9e7f7fbb9e4a |