Skip to main content

A storage solution for PyTorch tensors with distributed tensor support

Project description

TorchStore

A storage solution for PyTorch tensors with distributed tensor support.

TorchStore provides a distributed, asynchronous tensor storage system built on top of Monarch actors. It enables efficient storage and retrieval of PyTorch tensors across multiple processes and nodes with support for various transport mechanisms including RDMA when available.

Key Features:

  • Distributed tensor storage with configurable storage strategies
  • Asynchronous put/get operations for tensors and arbitrary objects
  • Support for PyTorch state_dict serialization/deserialization
  • Multiple transport backends (RDMA, regular TCP) for optimal performance
  • Flexible storage volume management and sharding strategies

⚠️ Early Development Warning TorchStore is currently in an experimental stage. You should expect bugs, incomplete features, and APIs that may change in future versions. The project welcomes bugfixes, but to make sure things are well coordinated you should discuss any significant change before starting the work. It's recommended that you signal your intention to contribute in the issue tracker, either by filing a new issue or by claiming an existing one.

Installation

Env Setup

conda create -n torchstore python=3.12
pip install torch

git clone git@github.com:meta-pytorch/monarch.git
python monarch/scripts/install_nightly.py

git clone git@github.com:meta-pytorch/torchstore.git
cd torchstore
pip install -e .

Development Installation

To install the package in development mode:

# Clone the repository
git clone https://github.com/your-username/torchstore.git
cd torchstore

# Install in development mode
pip install -e .

# Install development dependencies
pip install -e '.[dev]'

Regular Installation

To install the package directly from the repository:

pip install git+https://github.com/your-username/torchstore.git

Once installed, you can import it in your Python code:

import torchstore

Note: Setup currently assumes you have a working conda environment with both torch & monarch (this is currently a todo).

Usage

import torch
import asyncio
import torchstore as ts

async def main():

    # Create a store instance
    await ts.initialize()

    # Store a tensor
    await ts.put("my_tensor", torch.randn(3, 4))

    # Retrieve a tensor
    tensor = await ts.get("my_tensor")


if __name__ == "__main__":
    asyncio.run(main())

Resharding Support with DTensor

import torchstore as ts
from torch.distributed._tensor import distribute_tensor, Replicate, Shard
from torch.distributed.device_mesh import init_device_mesh

async def place_dtensor_in_store():
    device_mesh = init_device_mesh("cpu", (4,))
    tensor = torch.arange(4)
    dtensor = distribute_tensor(tensor, device_mesh, placements=[Shard(1)])

    # Store a tensor
    await ts.put("my_tensor", dtensor)


async def fetch_dtensor_from_store()
    # You can now fetch arbitrary shards of this tensor from any rank e.g.
    device_mesh = init_device_mesh("cpu", (2,2))
    tensor = torch.rand(4)
    dtensor = distribute_tensor(
        tensor,
        device_mesh,
        placements=[Replicate(), Shard(0)]
    )

    # This line copies the previously stored dtensor into local memory.
    await ts.get("my_tensor", dtensor)

def run_in_parallel(func):
    # just for demonstrative purposes
    return func

if __name__ == "__main__":
    ts.initialize()
    run_in_parallel(place_dtensor_in_store)
    run_in_parallel(fetch_dtensor_from_store)
    ts.shutdown()

# checkout out tests/test_resharding.py for more e2e examples with resharding DTensor.

Testing

Pytest is used for testing. For an examples of how to run tests (and get logs), see: TORCHSTORE_LOG_LEVEL=DEBUG pytest -vs --log-cli-level=DEBUG tests/test_models.py::test_main

License

Torchstore is BSD-3 licensed, as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchstore-0.0.1rc1.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

torchstore-0.0.1rc1-py3-none-any.whl (30.4 kB view details)

Uploaded Python 3

File details

Details for the file torchstore-0.0.1rc1.tar.gz.

File metadata

  • Download URL: torchstore-0.0.1rc1.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for torchstore-0.0.1rc1.tar.gz
Algorithm Hash digest
SHA256 01cb74ba015ed14007c0d7dfda8475dc4bf89d77643833865ff2c196bbbb80aa
MD5 878981ed140a5c58fea8f8981956ee68
BLAKE2b-256 cd15cc50a70721c7ea68380774b70fafbb521a757a7f080a40204845f722c305

See more details on using hashes here.

File details

Details for the file torchstore-0.0.1rc1-py3-none-any.whl.

File metadata

  • Download URL: torchstore-0.0.1rc1-py3-none-any.whl
  • Upload date:
  • Size: 30.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for torchstore-0.0.1rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 1dff4ba609535f43c98bb18c7ee0fe309bf5c8520a47f803c1e53466a8fd842a
MD5 760d9a882042f9fed01952e75535d2b0
BLAKE2b-256 376bd829685bea9244ac1c9059aed09ba846e0944403579b955db55d1cf73fb9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page