Numpy memory-mapped array with direct I/O. No sequential read caching, increase the speed of random read.

Project description

direct-mmap

Introduction

Numpy memory-mapped array with direct I/O. No sequential read caching, increase the speed of random read.

This package can increase speed a lot if you do random read in a memory-mapped array.

In general, when we read a file, the operating system will cache the read data in memory. For normal usage, this can increase the speed of frequent read of same data. However, when we random access the data, it will do sequential prefetching (but these data will not be used), which causes a lot of unnecessary reads. This package can avoid this by using direct I/O.

Installation

Requirements

Linux, x86_64 (because direct I/O is only supported on Linux). Python version >= 3.9.

Method 1: pip

pip install direct-mmap

Method 2: build from source

Install python3-dev (change to your python version, like python3.9-dev) first. This can be done by sudo add-apt-repository ppa:deadsnakes/ppa; sudo apt update; sudo apt install python3-dev (change to your python version) on Ubuntu.

Then run the following command:

git clone git@github.com:jtc1246/direct-mmap.git
cd direct-mmap/direct_mmap/cpp/

Then change the Makefile, python3.9-config --includes and python3.9-config --ldflags to your python version.

Then run the following command:

make
cd ../..
python setup.py install

Usage

from direct_mmap import direct_mmap

path = './test.npy'
mmap = direct_mmap(path, (10000,200), 'uint64', offset=0)
a = mmap[10:1000:10, 15:30]

The subscription method is generally similar to numpy, but currently it doesn't support using a numpy bool array. This will return a numpy array, not a view.

Currently all the shape and choice of data is handled in python, so it would be very slow if there is too much data segments (i.e. too much uncontinuous data parts). So for suggestion, you can remove the last dimension from subscription, and use numpy to select later.

Performance

I run a benchmark with direct-mmap and np.memmap. The performance of direct-mmap is much better than np.memmap.

Each task is following:

mmap = direct_mmap(file, (170640000, 40), 'uint64', offset=64) # direct_mmap
mmap = np.memmap(file, shape=(170640000, 40), dtype='uint64', offset=64) # np.memmap
mmap[id:id + 300000:1000]

This file is about 54 GB. There are 3 test cases. The first one is read 1000 times in single thread, the second is read 1000 times in 64 threads, the third is read 10000 times in 64 threads.

Results:

	np.memmap	direct_mmap
1000 tasks, single thread	90.38s	36s
1000 tasks, 64 threads	9.24s	0.98s
10000 tasks, 64 threads	20.29s	9.4s

Environment is following:

Disk: Samsung PM983, 3.84 TB, PCIe, 4K 64 threads random read about 500K IOPS, single thread sequential read 910 MB/s
System: Ubuntu 22.04, in docker ubuntu 20.04, python 3.9.18, 64G RAM
Connection: PCIe 3.0 x4, 4 GB/s
Drop cache: "sudo sync; sudo sysctl -w vm.drop_caches=3", before np.memmap testing

The testing code is in direct_mmap/main.py at last.

Project details

Release history Release notifications | RSS feed

This version

1.0.1

Mar 31, 2024

1.0.0

Mar 31, 2024

0.9.1

Mar 31, 2024

0.9.0

Mar 31, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

direct_mmap-1.0.1.tar.gz (18.8 kB view details)

Uploaded Mar 31, 2024 Source

File details

Details for the file direct_mmap-1.0.1.tar.gz.

File metadata

Download URL: direct_mmap-1.0.1.tar.gz
Upload date: Mar 31, 2024
Size: 18.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.31.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.12

File hashes

Hashes for direct_mmap-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`157660467c8bf37dfe27dcc70b9d9633e05043262962effae686fc9678ca9a1f`
MD5	`2f12eb75cdc1be58765774a7a4cbe5d9`
BLAKE2b-256	`3c680a674bae4c8f08f053ace860d647e7ae264db0f17e2814ab9e7f7fbb9e4a`

See more details on using hashes here.

direct-mmap 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

direct-mmap

Introduction

Installation

Requirements

Method 1: pip

Method 2: build from source

Usage

Performance

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes