Skip to main content

Composable data loading modules for PyTorch

Project description

TorchData (see note below on current status)

Why TorchData? | Install guide | What are DataPipes? | Beta Usage and Feedback | Contributing | Future Plans

:warning: As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space. Please reach out if you suggestions or comments (please use #1196 for feedback).

torchdata is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines.

This library introduces composable Iterable-style and Map-style building blocks called DataPipes that work well out of the box with the PyTorch's DataLoader. These built-in DataPipes have the necessary functionalities to reproduce many different datasets in TorchVision and TorchText, namely loading files (from local or cloud), parsing, caching, transforming, filtering, and many more utilities. To understand the basic structure of DataPipes, please see What are DataPipes? below, and to see how DataPipes can be practically composed together into datasets, please see our examples.

On top of DataPipes, this library provides a new DataLoader2 that allows the execution of these data pipelines in various settings and execution backends (ReadingService). You can learn more about the new version of DataLoader2 in our full DataLoader2 documentation. Additional features are work in progres, such as checkpointing and advanced control of randomness and determinism.

Note that because many features of the original DataLoader have been modularized into DataPipes, their source codes live as standard DataPipes in pytorch/pytorch rather than torchdata to preserve backward-compatibility support and functional parity within torch. Regardless, you can to them by importing them from torchdata.

Why composable data loading?

Over many years of feedback and organic community usage of the PyTorch DataLoader and Dataset, we've found that:

  1. The original DataLoader bundled too many features together, making them difficult to extend, manipulate, or replace. This has created a proliferation of use-case specific DataLoader variants in the community rather than an ecosystem of interoperable elements.
  2. Many libraries, including each of the PyTorch domain libraries, have rewritten the same data loading utilities over and over again. We can save OSS maintainers time and effort rewriting, debugging, and maintaining these commonly used elements.

These reasons inspired the creation of DataPipe and DataLoader2, with a goal to make data loading components more flexible and reusable.

Installation

Version Compatibility

The following is the corresponding torchdata versions and supported Python versions.

torch torchdata python
master / nightly main / nightly >=3.8, <=3.11
2.0.0 0.6.0 >=3.8, <=3.11
1.13.1 0.5.1 >=3.7, <=3.10
1.12.1 0.4.1 >=3.7, <=3.10
1.12.0 0.4.0 >=3.7, <=3.10
1.11.0 0.3.0 >=3.7, <=3.10

Colab

Follow the instructions in this Colab notebook. The notebook also contains a simple usage example.

Local pip or conda

First, set up an environment. We will be installing a PyTorch binary as well as torchdata. If you're using conda, create a conda environment:

conda create --name torchdata
conda activate torchdata

If you wish to use venv instead:

python -m venv torchdata-env
source torchdata-env/bin/activate

Install torchdata:

Using pip:

pip install torchdata

Using conda:

conda install -c pytorch torchdata

You can then proceed to run our examples, such as the IMDb one.

From source

pip install .

If you'd like to include the S3 IO datapipes and aws-sdk-cpp, you may also follow the instructions here

In case building TorchData from source fails, install the nightly version of PyTorch following the linked guide on the contributing page.

From nightly

The nightly version of TorchData is also provided and updated daily from main branch.

Using pip:

pip install --pre torchdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Using conda:

conda install torchdata -c pytorch-nightly

What are DataPipes?

Early on, we observed widespread confusion between the PyTorch Dataset which represented reusable loading tooling (e.g. TorchVision's ImageFolder), and those that represented pre-built iterators/accessors over actual data corpora (e.g. TorchVision's ImageNet). This led to an unfortunate pattern of siloed inheritance of data tooling rather than composition.

DataPipe is simply a renaming and repurposing of the PyTorch Dataset for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes, and returns a new access function with a slight transformation applied. For example, take a look at this JsonParser, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:

import json

class JsonParserIterDataPipe(IterDataPipe):
    def __init__(self, source_datapipe, **kwargs) -> None:
        self.source_datapipe = source_datapipe
        self.kwargs = kwargs

    def __iter__(self):
        for file_name, stream in self.source_datapipe:
            data = stream.read()
            yield file_name, json.loads(data, **self.kwargs)

    def __len__(self):
        return len(self.source_datapipe)

You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sophisticated data pipelines, with streamed operation as a first-class citizen.

Under this naming convention, Dataset simply refers to a graph of DataPipes, and a dataset module like ImageNet can be rebuilt as a factory function returning the requisite composed DataPipes. Note that the vast majority of built-in features are implemented as IterDataPipes, we encourage the usage of built-in IterDataPipe as much as possible and convert them to MapDataPipe only when necessary.

DataLoader2

A new, light-weight DataLoader2 is introduced to decouple the overloaded data-manipulation functionalities from torch.utils.data.DataLoader to DataPipe operations. Besides, certain features can only be achieved with DataLoader2, such as like checkpointing/snapshotting and switching backend services to perform high-performant operations.

Please read the full documentation here.

Tutorial

A tutorial of this library is available here on the documentation site. It covers four topics: using DataPipes, working with DataLoader, implementing DataPipes, and working with Cloud Storage Providers.

There is also a tutorial available on how to work with the new DataLoader2.

Usage Examples

We provide a simple usage example in this Colab notebook. It can also be downloaded and executed locally as a Jupyter notebook.

In addition, there are several data loading implementations of popular datasets across different research domains that use DataPipes. You can find a few selected examples here.

Frequently Asked Questions (FAQ)

What should I do if the existing set of DataPipes does not do what I need?

You can implement your own custom DataPipe. If you believe your use case is common enough such that the community can benefit from having your custom DataPipe added to this library, feel free to open a GitHub issue. We will be happy to discuss!

What happens when the Shuffler DataPipe is used with DataLoader?

In order to enable shuffling, you need to add a Shuffler to your DataPipe line. Then, by default, shuffling will happen at the point where you specified as long as you do not set shuffle=False within DataLoader.

What happens when the Batcher DataPipe is used with DataLoader?

If you choose to use Batcher while setting batch_size > 1 for DataLoader, your samples will be batched more than once. You should choose one or the other.

Why are there fewer built-in MapDataPipes than IterDataPipes?

By design, there are fewer MapDataPipes than IterDataPipes to avoid duplicate implementations of the same functionalities as MapDataPipe. We encourage users to use the built-in IterDataPipe for various functionalities, and convert it to MapDataPipe as needed.

How is multiprocessing handled with DataPipes?

Multi-process data loading is still handled by the DataLoader, see the DataLoader documentation for more details. As of PyTorch version >= 1.12.0 (TorchData version >= 0.4.0), data sharding is automatically done for DataPipes within the DataLoader as long as a ShardingFilter DataPipe exists in your pipeline. Please see the tutorial for an example.

What is the upcoming plan for DataLoader?

DataLoader2 is in the prototype phase and more features are actively being developed. Please see the README file in torchdata/dataloader2. If you would like to experiment with it (or other prototype features), we encourage you to install the nightly version of this library.

Why is there an Error saying the specified DLL could not be found at the time of importing portalocker?

It only happens for people who runs torchdata on Windows OS as a common problem with pywin32. And, you can find the reason and the solution for it in the link.

Contributing

We welcome PRs! See the CONTRIBUTING file.

Beta Usage and Feedback

We'd love to hear from and work with early adopters to shape our designs. Please reach out by raising an issue if you're interested in using this tooling for your project.

License

TorchData is BSD licensed, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

torchdata-0.7.1-py3-none-any.whl (184.4 kB view details)

Uploaded Python 3

torchdata-0.7.1-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

torchdata-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

torchdata-0.7.1-cp311-cp311-macosx_11_0_arm64.whl (4.8 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

torchdata-0.7.1-cp311-cp311-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11macOS 10.13+ x86-64

torchdata-0.7.1-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

torchdata-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

torchdata-0.7.1-cp310-cp310-macosx_11_0_arm64.whl (4.8 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

torchdata-0.7.1-cp310-cp310-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10macOS 10.13+ x86-64

torchdata-0.7.1-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

torchdata-0.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

torchdata-0.7.1-cp39-cp39-macosx_11_0_arm64.whl (4.8 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

torchdata-0.7.1-cp39-cp39-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.9macOS 10.13+ x86-64

torchdata-0.7.1-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8Windows x86-64

torchdata-0.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

torchdata-0.7.1-cp38-cp38-macosx_11_0_arm64.whl (4.8 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

torchdata-0.7.1-cp38-cp38-macosx_10_13_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8macOS 10.13+ x86-64

File details

Details for the file torchdata-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: torchdata-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 184.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for torchdata-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9f9476a26987d90fa3f87cb09ec82b78ce6031ddcaa91851c9fa9f732a987ab8
MD5 130defa30246310bae6b4e9a9d35d9f0
BLAKE2b-256 e2c834eda2bd6beb8a11c06cf905db74092bdbc3dec51a48f4f22cc474866a0a

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: torchdata-0.7.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for torchdata-0.7.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7460a5fa298e7cd5cef98e8e6455d481e5c73d39a462a89a38918389c8153e20
MD5 c54412fb28fa1aa98e48df7c9785f38c
BLAKE2b-256 da8de0413f91944f931cb5c685cbd6330ad450f9d5466c466822d25761ca772d

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d256535648dfb94d1226f233768c6798d1841edfdbf0a09b2115e6cbbda614f9
MD5 2aa86db0bb7627ef57cd660230b0866d
BLAKE2b-256 c18db17138a9ad7e47dd602587dbcc142bd98374e0c16c0806c2026d8db54242

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fa325d628aa6125c6b46b6fa27c94150ca9276edbba1042d3eb3cd9c1039b5a9
MD5 3ada8c31843b285acdc484ee71133cdd
BLAKE2b-256 35b27ed3a80ae0673b940f2af14281dc02dee0f667c6094e6dcd399fa35249a7

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp311-cp311-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp311-cp311-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 91a78c677a3e4e2447d888587f7ea0b4ddde81ca95adf709ba0d3dc9a4e9542d
MD5 42f8576fb00e832818c9306c2f5b6296
BLAKE2b-256 ad9a8b3c64a141b58228419110858acdd5eae7a1b54db9dd8f22a2af956ac53d

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: torchdata-0.7.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for torchdata-0.7.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 36c591d0910ede6a496d4fccd389f27e7bccabf8b6a8722712ecf28b98bda8ae
MD5 7e33af4b7202e614c307304d5b497f18
BLAKE2b-256 0e060c916f27ef9f5a566b555f07c82c94fb9277fcabe0fcbf4dfe4505dcb28a

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eed10b1f9265b30161a8cd129a96cfc7da202cfc70acf8b6a21fd29e18274ca3
MD5 4e333ef0949da85f2ec78a55e471c8d3
BLAKE2b-256 39186f0d33df4b9fe4d44a779c2c7cc7cb042535a336f051bb0e5b5387844ee6

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2d2c8482313dd52652caff99dc530433d898a12bb80bc33a0a4d1680d63272e0
MD5 e6a4062c65fe10935042cae2ef00e290
BLAKE2b-256 0f457c4674a8a4ac83f0e130d0991d61ff96a231656112bb7c50618d77ab0a8f

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp310-cp310-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp310-cp310-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 042db39edbc961c50a36c45b89aea4b099858140c13746c7cc7a87b1cc219d0c
MD5 9631eaa50d06ede006b7869495ca7a1b
BLAKE2b-256 d0976f8f2384d9efb2d1bb6966b5300852d704f4943656360e9352e3cc3358b8

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: torchdata-0.7.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for torchdata-0.7.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 432295d9d33a7497d3c4aee667998af5bd9dcf55bd10b77c6af1ac72249efe22
MD5 59f18986cdcc627189a4aa485dbcd3cc
BLAKE2b-256 0805d717b62841b32c29aabfb834d7fe606fdeb0420953b0391da1cde7804577

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ba802128f50bfa227be107027e0230581b3a4ac70d14782b44662b7c71159cf1
MD5 9f911f4c074d15b72b1bfcb44b4b4cb8
BLAKE2b-256 2050eed54bac9982e499d0248424be2f05719a9581f1134bdbca9468fdd1b5df

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 94ce50572550010db0431283b5d228f8727f779aafd9cbbbcdc37028a5085603
MD5 aafd670bda06b255e493b0f14163f03a
BLAKE2b-256 6d24d55364128c6f5427859c06b0804242677fc7d513f63fea7ff8ae7fc1f67b

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp39-cp39-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp39-cp39-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 53ef621460e2bc014069c126cbdcd325bf73d78836155a350767fe5f8ca29f11
MD5 727924c7c274be24f9a9429419e1c5fe
BLAKE2b-256 19cfbec9e8d2512523f3ccd0a6d62fceae9be10c06b6e23721d5d7d51bfdb409

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: torchdata-0.7.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for torchdata-0.7.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 293e399f3988fcd8d24156188342e9265273787dc0a29b1b37891a1045eeaece
MD5 552047d8d4ebcbca206d02c999569eab
BLAKE2b-256 9997e82800bd2639dd02f2b6d8dbcd7f0911853af42cb655f13a726b5c9d1f5c

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c1feae257e55b2942459a26f5597088e5daefabbd47a562081c01c0841a88e18
MD5 92a25a0d58fae98b58caceb263a1b937
BLAKE2b-256 5d6b98db0ba1c6dbbe2fa922210e101d588139678a496cd6487afa65aacfe587

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8bbaecdbc1a7dc4a9d7dc7545ea4c12be92a3df8d2494f089e974e25591f514a
MD5 8b74a6572458f7fd6510164ace7986e5
BLAKE2b-256 bf438200bf9c220194192feb3c49b920852f2f9fdb3ab407bd88c3f24b210c2e

See more details on using hashes here.

File details

Details for the file torchdata-0.7.1-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for torchdata-0.7.1-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 8ac74fc6ce8bf289b8d99ea183f78e1bf2a4754ea6a2b1dcb219095b0aaacb78
MD5 09e325ec8a86dffac254284f146c822d
BLAKE2b-256 f3d7b3692d43e62e1cb149f8b81f2e5314f7b63d389e3508c348c48c302836a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page