A small library for taking the transpose of arbitrarily large .csvs
Project description
torchcsv
An PyTorch Dataset subclass for handling numerical data too large to fit in local memory.
Installation
To install, run pip install torchcsv
.
Usage
The CSVDataset
class inherits from torch.Dataset
like we always do with custom Dataset classes. However, rather than reading the entire data and label .csv
into memory, we make two assumptions:
- The dataset is too large to fit in local memory
- The labels are contained in a separate file. If this isn't the case, consider using
Dask
to obtain the column of interest, and then continue.
So, we initialize the CSVDataset
object as
from torchcsv import CSVDataset
data = CSVDataset(
datafile='path/to/datafile.csv',
labelfile='path/to/labelfile.csv',
target_label='Animal Type', # Column name containing targets in labelfile.csv
# indices=idx_list # Optionally, pass a list of purely numeric indices to use instead of the entire indices of the labelfile
)
For example, getting a 16.3k dimensional sample takes
> %%time
> test[1]
CPU times: user 5.99 ms, sys: 576 µs, total: 6.56 ms
Wall time: 6.19 ms
(tensor([0., 0., 0., ..., 0., 0., 0.]), 16)
Now, we can use this like a regular PyTorch Dataset, but without having to worry about memory issues!
For example,
from torch.utils.data import Dataloader
data = DataLoader(data, batch_size=4, num_workers=0)
Gives us that
%%time
next(iter(test))
CPU times: user 25.6 ms, sys: 20.8 ms, total: 46.4 ms
Wall time: 76.9 ms
[tensor([[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 1.2663, ..., 0.0000, 0.0000, 0.0000]]),
tensor([16, 16, 4, 4])]
So loading a minibatch of size 4 takes about a quarter of a second. The CSVDataset
class should be scalable, and will keep in memory what it can via the linecache
library.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file torchcsv-0.0.3.tar.gz
.
File metadata
- Download URL: torchcsv-0.0.3.tar.gz
- Upload date:
- Size: 185.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a9c81d0eda4bccbf95c0feb9323a4df72a64812b9a4e0b576be962751338678 |
|
MD5 | 331aececdab8e1c009dc00c6c608798c |
|
BLAKE2b-256 | 344754f43af21e7c73fdf9ee18b88bec6276599e1ff2ebe024cc3500fffe99b4 |
File details
Details for the file torchcsv-0.0.3-py2.py3-none-any.whl
.
File metadata
- Download URL: torchcsv-0.0.3-py2.py3-none-any.whl
- Upload date:
- Size: 3.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bfb2bb8d450a714609253afad1714ca67e20d92880270436c5aa3a827e5065a |
|
MD5 | 82e2a447f74b9d99aca3d76ad2361118 |
|
BLAKE2b-256 | 0de5007a524fdeef647b612f4b48317676a396ddc5b56dcc970f6b22507af3aa |