The AcumenIndexer
Project description
Acumen 👉🏻 Indexer 👈🏻
Coded with love and coffee ☕ by Adrian Cosma. But I need more coffee!
Description
AcumenIndexer is designed to help with organizing various big datasets with many small instances into a common, highly efficient format, enabling random accessing, using either RAM or HDD for storing binary data chunks.
But why?
Currently, the way storing and accessing data is performed is inefficient, especially for begginer data scientists, each practitioner having its own way of doing things. It is not always possible to store the whole dataset in RAM memory, so a usual approach is resorting to splitting each training instance in a separate file. Datasets comprised of many images or small files are very difficult to handle in practice (i.e., transferring the dataset through ssh, zipping takes a long time). Many files in a single folder can lead to performance issues on certain filesystems and lead to crashes.
But how?
A simple way to overcome the issue of big dataset with many small instances is to store in RAM only the metadata and the index, and use a random access mechanism for big binary chunks of data on disk.
Say what?
We make use of the native Python I/O operations of f.seek(), f.read() to read and write from large binary chunk files. We build a custom index based on byte offsets to access any training instance in O(1). Chunks can be mmap()-ed into RAM if memory is available to speed up I/O operations.
Installation
Install the pypi package via pip:
pip install -U acumenindexer
Alternatively, install directly via git:
pip install -U git+https://github.com/cosmaadrian/acumen-indexer
Usage
Building an index
def data_read_fn(path):
# read image from file
image = cv2.imread(path) # or something like this
# must return (data:numpy.ndarray, metadata:dict)
return image, x
file_names = [x for x in os.listdir('images/')]
ai.split_into_chunks(
data_list = file_names,
read_fn = data_read_fn,
output_path = 'my_data',
chunk_size_bytes = 5 * 1024 * 1024, #5MB
use_gzip = False,
dtype = np.float16,
n_jobs = 1,
)
Reading from index
import numpy as np
import acumenindexer as ai
the_index = ai.load_index('index.csv') # just a pd.DataFrame
# in_memory = False reads directly from chunk in O(1) using f.seek()
# in_memory = True uses mmap to map the data into RAM
read_fn = ai.read_from_index(the_index, dtype = np.float16, in_memory = True, use_gzip = False)
for i in range(10):
data = read_fn(i)
print(data) # contains both metadata and actual binary data
Use with PyTorch Datasets
from torch.utils.data import Dataset
import acumenindexer as ai
class CustomDataset(Dataset):
def __init__(self, index_path):
self.index = ai.load_index(the_index, dtype = np.float16, in_memory = True, use_gzip = False)
self.read_fn = ai.read_from_index(self.index)
def __len__(self):
return len(self.index)
def __getitem__(self, idx):
data = self.read_fn(idx)
return data
License
This repository uses MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file acumenindexer-0.0.1.tar.gz.
File metadata
- Download URL: acumenindexer-0.0.1.tar.gz
- Upload date:
- Size: 6.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82a617afbabc7016269ddeef37101b636563df612182389395009914e1badc5b
|
|
| MD5 |
d88c53cb75a3bbbbcb575d8944d74e9c
|
|
| BLAKE2b-256 |
da7530e69642165128b19a288ee415d5547e17f47700a6d6fb80db8d403a7fb8
|
File details
Details for the file acumenindexer-0.0.1-py3-none-any.whl.
File metadata
- Download URL: acumenindexer-0.0.1-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b47c07c65b619b5b80ed6eb8a0c7958be77be0c4a8aee26b2e55a542e31e18cb
|
|
| MD5 |
cbc8b48e642a17e9f351c32d888f36d2
|
|
| BLAKE2b-256 |
f6ce65f931bf053cfa00f76587a679efeabb655b629ebbf83a78651e6b78bd23
|