Skip to main content

A small library for taking the transpose of arbitrarily large .csvs

Project description

torchcsv

An PyTorch Dataset subclass for handling numerical data too large to fit in local memory.

Installation

To install, run pip install torchcsv.

Usage

The CSVDataset class inherits from torch.Dataset like we always do with custom Dataset classes. However, rather than reading the entire data and label .csv into memory, we make two assumptions:

  1. The dataset is too large to fit in local memory
  2. The labels are contained in a separate file. If this isn't the case, consider using Dask to obtain the column of interest, and then continue.

So, we initialize the CSVDataset object as

from torchcsv import CSVDataset 

data = CSVDataset(
    datafile='path/to/datafile.csv',
    labelfile='path/to/labelfile.csv',
    target_label='Animal Type', # Column name containing targets in labelfile.csv
    # indices=idx_list # Optionally, pass a list of purely numeric indices to use instead of the entire indices of the labelfile 
)

For example, getting a 16.3k dimensional sample takes

> %%time
> test[1]
CPU times: user 5.99 ms, sys: 576 µs, total: 6.56 ms
Wall time: 6.19 ms
(tensor([0., 0., 0.,  ..., 0., 0., 0.]), 16)

Now, we can use this like a regular PyTorch Dataset, but without having to worry about memory issues!

For example,

from torch.utils.data import Dataloader 
data = DataLoader(data, batch_size=4, num_workers=0)

Gives us that

%%time 
next(iter(test))

CPU times: user 25.6 ms, sys: 20.8 ms, total: 46.4 ms
Wall time: 76.9 ms

[tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 1.2663,  ..., 0.0000, 0.0000, 0.0000]]),
 tensor([16, 16,  4,  4])]

So loading a minibatch of size 4 takes about a quarter of a second. The CSVDataset class should be scalable, and will keep in memory what it can via the linecache library.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchcsv-0.0.3.tar.gz (185.2 kB view details)

Uploaded Source

Built Distribution

torchcsv-0.0.3-py2.py3-none-any.whl (3.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file torchcsv-0.0.3.tar.gz.

File metadata

  • Download URL: torchcsv-0.0.3.tar.gz
  • Upload date:
  • Size: 185.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for torchcsv-0.0.3.tar.gz
Algorithm Hash digest
SHA256 9a9c81d0eda4bccbf95c0feb9323a4df72a64812b9a4e0b576be962751338678
MD5 331aececdab8e1c009dc00c6c608798c
BLAKE2b-256 344754f43af21e7c73fdf9ee18b88bec6276599e1ff2ebe024cc3500fffe99b4

See more details on using hashes here.

File details

Details for the file torchcsv-0.0.3-py2.py3-none-any.whl.

File metadata

  • Download URL: torchcsv-0.0.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 3.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for torchcsv-0.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8bfb2bb8d450a714609253afad1714ca67e20d92880270436c5aa3a827e5065a
MD5 82e2a447f74b9d99aca3d76ad2361118
BLAKE2b-256 0de5007a524fdeef647b612f4b48317676a396ddc5b56dcc970f6b22507af3aa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page