Skip to main content

mmap.ninja: Memory mapped data structures

Project description

forthebadge made-with-python

mmap.ninja

Install with:

pip install mmap_ninja

Gitter Open In Collab Build Status codecov Downloads PyPi version PyPI license

Microlib docs can be found here.

Contents

  1. Quick example
  2. What is it?
  3. When to use it?
  4. When not to use it?
  5. How it works?
  6. API guide
  7. FAQ
  8. I want to contribute

Quick example

import numpy as np
import matplotlib.image as mpimg
from tqdm import tqdm
from pathlib import Path
from mmap_ninja.ragged import RaggedMmap

coco_path = Path('<PATH TO IMAGE DATASET>')

# Once per project, convert the images to a memory map
RaggedMmap.from_generator(
    # Directory in which the memory map will be persisted
    out_dir='images_mmap',
    # Something that yields np.ndarray
    sample_generator=map(mpimg.imread, coco_path.iterdir()),
    # Maximum number of samples to keep in memory before flushing to disk
    batch_size=1024,
    # Show/hide progress bar
    verbose=True
)

# Open the memory map
images_mmap = RaggedMmap('images_mmap')

# This iteration takes 0.2s on COCO val 2017
# This iteration takes 35s without memory-mapping
for i in tqdm(range(len(images_mmap))):
  img: np.ndarray = images_mmap[i]

Back to Contents

What is it?

Accelerate the iteration over your machine learning dataset by up to 20 times !

mmap_ninja is a library for storing your datasets in memory-mapped files, which leads to a dramatic speedup in the training time.

The only dependencies are numpy and tqdm.

You can use mmap_ninja with any training framework (such as Tensorflow, PyTorch, MxNet), etc., as it stores your dataset as a memory-mapped numpy array.

A memory mapped file is a file that is physically present on disk in a way that the correlation between the file and the memory space permits applications to treat the mapped portions as if it were primary memory, allowing very fast I/O!

When working on a machine learning project, one of the most time-consuming parts is the model's training. However, a large portion of the training time actually consists of just iterating over your dataset and filesystem I/O!

This library, mmap_ninja provides high-level, easy to use, well tested API for using memory maps for your datasets, reducing the time needed for training.

Memory maps would usually take a little more disk space though, so if you are willing to trade some disk space for fast filesystem to memory I/O, this is your library!

Back to Contents

When do I use it?

Use it whenever you want to store a sequence of np.ndarrays (of varying shapes) that you are going to read from at random positions very often.

mmap_ninja can work with any type of data that can be stored as a np.ndarray, as the memory map is initialized with a generator that yields samples.

In the table below, you can see concrete examples, but beware that those are just examples, mmap_ninja has no specific logic to handle images or videos or something like that.

It just stores np.ndarray and it is up to you to decide what this array represents.

Use case Notebook Benchmark Class/Module
Image Open In Collab COCO 2017 from mmap_ninja.ragged import RaggedMmap
Text Open In Collab 20 newsgroups from mmap_ninja.string import StringsMmap
Video Open In Collab Coming soon! from mmap_ninja import numpy as RaggedMmap

Back to Contents

Memory mapping images with different shapes

You can create a new RaggedMmmap from one of its class methods: RaggedMmmap.from_lists, RaggedMmap.from_generator.

Create a memory map from generator, flushing to disk every 1024 images (so that you don't have to keep it all in memory at once):

import matplotlib.pyplot as plt
from mmap_ninja.ragged import RaggedMmap
from pathlib import Path

coco_path = Path('<PATH TO IMAGE DATASET>')
val_images = RaggedMmap.from_generator(
    out_dir='val_images', 
    sample_generator=map(plt.imread, coco_path.iterdir()), 
    batch_size=1024, 
    verbose=True
)

Once created, you can open the map by simply supplying the path to the memory map:

from mmap_ninja.ragged import RaggedMmap

val_images = RaggedMmap('val_images')
print(val_images[3]) # Prints the ndarray image, e.g. with shape (387, 640, 3)

You can also extend an already existing memory map easily by using the .extend method.

In the table show the time needed for initial loading, one iteration over the COCO validation 2017 dataset, the memory usage of every method and the disk usage.

Initial load (s) Time for iteration (s) Memory usage (GB) Disk usage (GB)
in_memory 1.356077 0.000403 3.818741 GB 3.819034 GB
ragged_mmap 0.002054 0.057858 0.001144 GB 3.819114 GB
imread_from_disk 0.000000 22.208385 0.001144 GB 0.758753 GB

You can see that once created, the RaggedMmap is 383 times faster for iterating over the dataset. It does require 4 times more disk space though, so if you are willing to trade 4 times more disk space for 383 times speedup (and less memory usage), you definitely should use the RaggedMmap!

This makes the RaggedMmap a fantastic choice for your computer vision, image-based machine learning datasets!

Memory mapping text documents

You can create a new StringsMmmap from one of its class methods: StringsMmmap.from_strings, StringsMmap.from_generator. Once it's created, you can open it by just supplying the path to the memory map.

An example of creating a memory map for the sklearn's 20newsgroups dataset:

from mmap_ninja.string import StringsMmap
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
memmap = StringsMmap.from_strings('20newsgroups', data['data'], verbose=True)

Opening an already existing StringsMmmap:

from mmap_ninja.string import StringsMmap

texts = StringsMmap('20newsgroups')
print(texts[123])  # Prints the 123-th text

You can also extend an already existing memory map easily by using the .extend method.

In the table show the time needed for initial loading, 100 iterations over the sklearn's 20newsgroups dataset, the memory usage of every method and the disk usage.

Initial load (s) Time for iteration (s) Memory usage (GB) Disk usage (GB)
in_memory 0.174626 0.068995 0.09 MB 45 MB
ragged_mmap 0.003701 2.052659 0.07 MB 22 MB
read_from_disk 0.000000 13.996738 0.07 MB 45 MB

You can see that once created, the StringsMmap is nearly 7 times faster compared to reading .txt files from disk one by one. Moreover, it takes 2 times less disk space (this is true only for StringsMmap, in general for other types the memory map would take more disk space). This makes the StringsMmmap a fantastic choice for your NLP, text-based machine learning datasets!

When not to use it?

Very frequently, mmap_ninja takes more disk space than traditional approaches. For example, for jpeg images, it takes 4 times more disk space.

For this reason, do not use mmap_ninja in the following cases:

  • You are low on disk space
  • You want to send the data over a network - use a compressed format instead

There are other cases in which mmap_ninja is not a good choice:

  • When you want to concurrently append to the memory map (use a queue like RabbitMQ and append from a subscriber instead)
  • If you want to frequently delete samples from the memory map - this will require a new copy of the whole object

and so on.

Back to Contents

How it works

Coming soon

Back to Contents

API guide

Read the docs

Back to Contents

FAQ

Q: Can I use it with Tensorflow/TF? A: Of course. You can use it with any framework that can work with numpy arrays. Here's an end-to-end example

Back to Contents

I want to contribute

Coming soon!

Back to Contents

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmap_ninja-0.8.2.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mmap_ninja-0.8.2-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file mmap_ninja-0.8.2.tar.gz.

File metadata

  • Download URL: mmap_ninja-0.8.2.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for mmap_ninja-0.8.2.tar.gz
Algorithm Hash digest
SHA256 62173f72ffef553eefd0cfb9f5dfe0874ca89ce9e161d7719066ac0bed58097e
MD5 0eb801d1dc695ba3cfd719b9cfebf0d8
BLAKE2b-256 4acd96907ed33c323b522a11e595cac3f4169a89a32c378045cb94cb899bf203

See more details on using hashes here.

File details

Details for the file mmap_ninja-0.8.2-py3-none-any.whl.

File metadata

  • Download URL: mmap_ninja-0.8.2-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for mmap_ninja-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 32d2200963e9b123b6a4078622bd3df5b1b03f4b34129351f967a7d83c4f2e1f
MD5 8dc34806959b1c20a3118ca6fac382da
BLAKE2b-256 132fd1dfe4666ffa07a44367d478f2bd55db154386f17e856f99e43d3ad6af87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page