Skip to main content

The AcumenIndexer

Project description

drawing Acumen 👉🏻 Indexer 👈🏻

Coded with love and coffee ☕ by Adrian Cosma. But I need more coffee!

Buy Me A Coffee

Description

AcumenIndexer is designed to help with organizing various big datasets with many small instances into a common, highly efficient format, enabling random accessing, using either RAM or HDD for storing binary data chunks.

But why?

Currently, the way storing and accessing data is performed is inefficient, especially for begginer data scientists, each practitioner having its own way of doing things. It is not always possible to store the whole dataset in RAM memory, so a usual approach is resorting to splitting each training instance in a separate file. Datasets comprised of many images or small files are very difficult to handle in practice (i.e., transferring the dataset through ssh, zipping takes a long time). Many files in a single folder can lead to performance issues on certain filesystems and lead to crashes.

But how?

A simple way to overcome the issue of big dataset with many small instances is to store in RAM only the metadata and the index, and use a random access mechanism for big binary chunks of data on disk.

Say what?

We make use of the native Python I/O operations of f.seek(), f.read() to read and write from large binary chunk files. We build a custom index based on byte offsets to access any training instance in O(1). Chunks can be mmap()-ed into RAM if memory is available to speed up I/O operations.

Installation

Install the pypi package via pip:

pip install -U acumenindexer

Alternatively, install directly via git:

pip install -U git+https://github.com/cosmaadrian/acumen-indexer

Usage

Building an index

def data_read_fn(path):
    # read image from file
    image = cv2.imread(path) # or something like this

    # must return (data:numpy.ndarray, metadata:dict)
    return image, x

file_names = [x for x in os.listdir('images/')]

ai.split_into_chunks(
    data_list = file_names,
    read_fn = data_read_fn,
    output_path = 'my_data',
    chunk_size_bytes = 5 * 1024 * 1024, #5MB
    use_gzip = False,
    dtype = np.float16,
    n_jobs = 1,
)

Reading from index

import numpy as np
import acumenindexer as ai

the_index = ai.load_index('index.csv') # just a pd.DataFrame

# in_memory = False reads directly from chunk in O(1) using f.seek()
# in_memory = True uses mmap to map the data into RAM
read_fn = ai.read_from_index(the_index, dtype = np.float16, in_memory = True, use_gzip = False)

for i in range(10):
    data = read_fn(i)
    print(data) # contains both metadata and actual binary data

Use with PyTorch Datasets

from torch.utils.data import Dataset
import acumenindexer as ai

class CustomDataset(Dataset):
    def __init__(self, index_path):
        self.index = ai.load_index(the_index, dtype = np.float16, in_memory = True, use_gzip = False)
        self.read_fn = ai.read_from_index(self.index)

    def __len__(self):
        return len(self.index)

    def __getitem__(self, idx):
        data = self.read_fn(idx)
        return data

License

This repository uses MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acumenindexer-0.0.1.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acumenindexer-0.0.1-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file acumenindexer-0.0.1.tar.gz.

File metadata

  • Download URL: acumenindexer-0.0.1.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for acumenindexer-0.0.1.tar.gz
Algorithm Hash digest
SHA256 82a617afbabc7016269ddeef37101b636563df612182389395009914e1badc5b
MD5 d88c53cb75a3bbbbcb575d8944d74e9c
BLAKE2b-256 da7530e69642165128b19a288ee415d5547e17f47700a6d6fb80db8d403a7fb8

See more details on using hashes here.

File details

Details for the file acumenindexer-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: acumenindexer-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for acumenindexer-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b47c07c65b619b5b80ed6eb8a0c7958be77be0c4a8aee26b2e55a542e31e18cb
MD5 cbc8b48e642a17e9f351c32d888f36d2
BLAKE2b-256 f6ce65f931bf053cfa00f76587a679efeabb655b629ebbf83a78651e6b78bd23

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page