Skip to main content

Composable data loading modules for PyTorch

Project description

TorchData

Why TorchData? | Install guide | What are DataPipes? | Beta Usage and Feedback | Contributing | Future Plans

This library is currently in the Beta stage and new features are under active development. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but a few future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

torchdata is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines.

This library introduces composable Iterable-style and Map-style building blocks called DataPipes that work well out of the box with the PyTorch's DataLoader. These built-in DataPipes have the necessary functionalities to reproduce many different datasets in TorchVision and TorchText, namely loading files (from local or cloud), parsing, caching, transforming, filtering, and many more utilities. To understand the basic structure of DataPipes, please see What are DataPipes? below, and to see how DataPipes can be practically composed together into datasets, please see our examples.

On top of DataPipes, this library provides a new DataLoader2 that allows the execution of these data pipelines in various settings and execution backends (ReadingService). You can learn more about the new version of DataLoader2 in our full DataLoader2 documentation. Additional features are work in progres, such as checkpointing and advanced control of randomness and determinism.

Note that because many features of the original DataLoader have been modularized into DataPipes, their source codes live as standard DataPipes in pytorch/pytorch rather than torchdata to preserve backward-compatibility support and functional parity within torch. Regardless, you can to them by importing them from torchdata.

Why composable data loading?

Over many years of feedback and organic community usage of the PyTorch DataLoader and Dataset, we've found that:

  1. The original DataLoader bundled too many features together, making them difficult to extend, manipulate, or replace. This has created a proliferation of use-case specific DataLoader variants in the community rather than an ecosystem of interoperable elements.
  2. Many libraries, including each of the PyTorch domain libraries, have rewritten the same data loading utilities over and over again. We can save OSS maintainers time and effort rewriting, debugging, and maintaining these commonly used elements.

These reasons inspired the creation of DataPipe and DataLoader2, with a goal to make data loading components more flexible and reusable.

Installation

Version Compatibility

The following is the corresponding torchdata versions and supported Python versions.

torch torchdata python
master / nightly main / nightly >=3.8, <=3.10
1.13.1 0.5.1 >=3.7, <=3.10
1.12.1 0.4.1 >=3.7, <=3.10
1.12.0 0.4.0 >=3.7, <=3.10
1.11.0 0.3.0 >=3.7, <=3.10

Colab

Follow the instructions in this Colab notebook. The notebook also contains a simple usage example.

Local pip or conda

First, set up an environment. We will be installing a PyTorch binary as well as torchdata. If you're using conda, create a conda environment:

conda create --name torchdata
conda activate torchdata

If you wish to use venv instead:

python -m venv torchdata-env
source torchdata-env/bin/activate

Install torchdata:

Using pip:

pip install torchdata

Using conda:

conda install -c pytorch torchdata

You can then proceed to run our examples, such as the IMDb one.

From source

pip install .

If you'd like to include the S3 IO datapipes and aws-sdk-cpp, you may also follow the instructions here

In case building TorchData from source fails, install the nightly version of PyTorch following the linked guide on the contributing page.

From nightly

The nightly version of TorchData is also provided and updated daily from main branch.

Using pip:

pip install --pre torchdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Using conda:

conda install torchdata -c pytorch-nightly

What are DataPipes?

Early on, we observed widespread confusion between the PyTorch Dataset which represented reusable loading tooling (e.g. TorchVision's ImageFolder), and those that represented pre-built iterators/accessors over actual data corpora (e.g. TorchVision's ImageNet). This led to an unfortunate pattern of siloed inheritance of data tooling rather than composition.

DataPipe is simply a renaming and repurposing of the PyTorch Dataset for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes, and returns a new access function with a slight transformation applied. For example, take a look at this JsonParser, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:

import json

class JsonParserIterDataPipe(IterDataPipe):
    def __init__(self, source_datapipe, **kwargs) -> None:
        self.source_datapipe = source_datapipe
        self.kwargs = kwargs

    def __iter__(self):
        for file_name, stream in self.source_datapipe:
            data = stream.read()
            yield file_name, json.loads(data, **self.kwargs)

    def __len__(self):
        return len(self.source_datapipe)

You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sophisticated data pipelines, with streamed operation as a first-class citizen.

Under this naming convention, Dataset simply refers to a graph of DataPipes, and a dataset module like ImageNet can be rebuilt as a factory function returning the requisite composed DataPipes. Note that the vast majority of built-in features are implemented as IterDataPipes, we encourage the usage of built-in IterDataPipe as much as possible and convert them to MapDataPipe only when necessary.

DataLoader2

A new, light-weight DataLoader2 is introduced to decouple the overloaded data-manipulation functionalities from torch.utils.data.DataLoader to DataPipe operations. Besides, certain features can only be achieved with DataLoader2, such as like checkpointing/snapshotting and switching backend services to perform high-performant operations.

Please read the full documentation here.

Tutorial

A tutorial of this library is available here on the documentation site. It covers four topics: using DataPipes, working with DataLoader, implementing DataPipes, and working with Cloud Storage Providers.

There is also a tutorial available on how to work with the new DataLoader2.

Usage Examples

We provide a simple usage example in this Colab notebook. It can also be downloaded and executed locally as a Jupyter notebook.

In addition, there are several data loading implementations of popular datasets across different research domains that use DataPipes. You can find a few selected examples here.

Frequently Asked Questions (FAQ)

What should I do if the existing set of DataPipes does not do what I need?

You can implement your own custom DataPipe. If you believe your use case is common enough such that the community can benefit from having your custom DataPipe added to this library, feel free to open a GitHub issue. We will be happy to discuss!

What happens when the Shuffler DataPipe is used with DataLoader?

In order to enable shuffling, you need to add a Shuffler to your DataPipe line. Then, by default, shuffling will happen at the point where you specified as long as you do not set shuffle=False within DataLoader.

What happens when the Batcher DataPipe is used with DataLoader?

If you choose to use Batcher while setting batch_size > 1 for DataLoader, your samples will be batched more than once. You should choose one or the other.

Why are there fewer built-in MapDataPipes than IterDataPipes?

By design, there are fewer MapDataPipes than IterDataPipes to avoid duplicate implementations of the same functionalities as MapDataPipe. We encourage users to use the built-in IterDataPipe for various functionalities, and convert it to MapDataPipe as needed.

How is multiprocessing handled with DataPipes?

Multi-process data loading is still handled by the DataLoader, see the DataLoader documentation for more details. As of PyTorch version >= 1.12.0 (TorchData version >= 0.4.0), data sharding is automatically done for DataPipes within the DataLoader as long as a ShardingFilter DataPipe exists in your pipeline. Please see the tutorial for an example.

What is the upcoming plan for DataLoader?

DataLoader2 is in the prototype phase and more features are actively being developed. Please see the README file in torchdata/dataloader2. If you would like to experiment with it (or other prototype features), we encourage you to install the nightly version of this library.

Why is there an Error saying the specified DLL could not be found at the time of importing portalocker?

It only happens for people who runs torchdata on Windows OS as a common problem with pywin32. And, you can find the reason and the solution for it in the link.

Contributing

We welcome PRs! See the CONTRIBUTING file.

Beta Usage and Feedback

We'd love to hear from and work with early adopters to shape our designs. Please reach out by raising an issue if you're interested in using this tooling for your project.

License

TorchData is BSD licensed, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

torchdata-0.6.1-py3-none-any.whl (153.0 kB view details)

Uploaded Python 3

torchdata-0.6.1-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

torchdata-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

torchdata-0.6.1-cp311-cp311-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

torchdata-0.6.1-cp311-cp311-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11macOS 10.13+ x86-64

torchdata-0.6.1-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

torchdata-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

torchdata-0.6.1-cp310-cp310-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

torchdata-0.6.1-cp310-cp310-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10macOS 10.13+ x86-64

torchdata-0.6.1-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

torchdata-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

torchdata-0.6.1-cp39-cp39-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

torchdata-0.6.1-cp39-cp39-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.9macOS 10.13+ x86-64

torchdata-0.6.1-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8Windows x86-64

torchdata-0.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

torchdata-0.6.1-cp38-cp38-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

torchdata-0.6.1-cp38-cp38-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8macOS 10.13+ x86-64

File details

Details for the file torchdata-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: torchdata-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 153.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for torchdata-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6a6d51cfbb63efe65788ad71e84c2be23a0c6520869e075774e7fc2ee535b9ed
MD5 599450f1b8134dbb56396caa55368bb6
BLAKE2b-256 ce9d7f664bd911e6b95464372ec73a26d289148c20ae0b511c6b958e9345d1c5

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: torchdata-0.6.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for torchdata-0.6.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 3501325411063b0da6d85c93222234973780ca3df7d110ee1795723f9b2e3405
MD5 e0733d97f9b6e534184fb7027c3f7c07
BLAKE2b-256 973d7c3305c5f02246f181dabeee5a6fe79886bf1ca4e3d1f577a457544c8899

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6a9d4c7d86b0aab2f70cdf0eae50f6eea997051889b0c04ae2f843df048ba9ea
MD5 67a992e2564ebbbe2f5b2e69e3382abe
BLAKE2b-256 eed65b009d230de86a6fc720c89e687d66b64f9a3e39e070cdd46a6f73d61032

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 05db07d2793d181867578be99c211b3c71f83f3fda0bf33a70e5c93794513692
MD5 711761d786c113290443ba6dea005147
BLAKE2b-256 135355bf15592605643e9ab1e612814ddd360f008b78142eb5782b12be63389a

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp311-cp311-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp311-cp311-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 6474ed86ed7e060698888fa8ece9155dc94bdb9dfcc1874fa6d7707823551adc
MD5 ca54d79d812dbf20701a8ac60de9558e
BLAKE2b-256 cbe0281683ea834b0fa01cfebe0b95933c50610c581bde80e2a76e0684762ec7

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: torchdata-0.6.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for torchdata-0.6.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 f2a8b0a388753502d49bd5bd186cd8d41dc2d91121246617804578371591b5e6
MD5 79002fc83ac2b40832aa279fe5a8f454
BLAKE2b-256 d631f762f210abfd6138c38794241e397d936ad6f5942e41089040f4d1e15183

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cae1b4a516124c00ef47db12bab0cc6898bd2d7e749ffb14b6ebf1fe610a6b46
MD5 e25871a1055767b06a8a9181316ec8a6
BLAKE2b-256 d3c713a05a863b8fdb3882f9765ff934490fca9862b634e1a0551596bacb2ee0

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ff46aa549c0da9aa2bcba95767034a81c950dc7ccc0e24f38cf6f7878ca1826d
MD5 09492aaceb9a2c1b41b32f9d771adf43
BLAKE2b-256 b92e64e82988e651f97a39c416da2fe2d7f4ae2513d94bd5297d076c1b878566

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp310-cp310-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp310-cp310-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 b97f60619764d5f2f62ebcf18bff8a7cfc5c84d30d04d30724b753ec95864a70
MD5 e0a59996a3c818a154da041a5efba838
BLAKE2b-256 5b4ef45637dec313942f0af18804f8a5ed84b61a35ff47299a9c41f630023b39

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: torchdata-0.6.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for torchdata-0.6.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 24d7fb2fcb0af43660fc44d8583cb8f7d25762bb759e8fde003db29c5036d940
MD5 9f96e7e726ee4ceb67409b41e49127f2
BLAKE2b-256 68ea6ff606d9ad3b9d59a3d8d190a92b0d4447ef6580e87689303f7aee67f9e4

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 87d012adf90cd15d7164a886512b6746788477172394fe56971799d276f8809e
MD5 9a67ead9a7bbf0345eea14635bb6c337
BLAKE2b-256 f30410c6dc46a6ce6cb9f4a546cad55dbe9fa3ac38be8d47d24cf2e22722df9c

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 42961c40bf9c3fb66aaf4189f9154cb4f2eb3862359e64267708e3d273452d68
MD5 bb6db4db7cba4da89e3cdcee8eaa2e2a
BLAKE2b-256 3f1b587e8a224e0035176f397c146b351a9c291e8c3c04444fcc288972ecc1c7

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp39-cp39-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp39-cp39-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 7e11108140adaba55491b6124f07b61cce161f53f8604e70961da025ccd014bd
MD5 2d979f86245ce0c92d7b780ef92d21f7
BLAKE2b-256 e0e01dccab0c45f2816b631da97d9f9832773a415c44b9f82acd4629602050fe

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: torchdata-0.6.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for torchdata-0.6.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 080c1c54128b4eb8e83cf4de74fa38e892cd6bf3114a557a853969543285f2d9
MD5 c8a4de411d2f44e648306fea9f18f115
BLAKE2b-256 5d5f828c12612d2dcb499dcba841816256e21f9c85f105c719a862c5c40da1f9

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 08b745c14adea7e4a36de4241485f9ea058e3ad3959c3e6dd1526c2f278006e7
MD5 08c90863db4af8345ef2d0d420757436
BLAKE2b-256 80ecd28143bb5e32e1dab50523fe7ba084205bdce490dd4127b407ce7953969c

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2752fd8452d42aeaf480985f5d3d82506ab1ce51ca42cdb61c981145cd2890a8
MD5 567c73eb66cbee0af1255d53bc392db6
BLAKE2b-256 cc06a5a2448821121372b1ec652d60a53ccc8c68c556ef9de28bae7881a81e6e

See more details on using hashes here.

File details

Details for the file torchdata-0.6.1-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.6.1-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 129d5b2d35717862b194b31a27b61b5faa9efd4ffc36c1a07d85b320dea8b4c6
MD5 3203855c18b1c0e98e91181f7978db3b
BLAKE2b-256 7eba02d148ad8acee2436691a5e203f68a534308652eabbf5a36076d7ba421fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page