Skip to main content

A library for test-time training.

Project description

TTT

TTT is a repository for test-time training kernels.

Currently, we only support non-causal TTT-MLP kernels with head dimension of 64. Remat is automatically supported with these kernels.

Here is an example on how to invoke the kernels.

import test_time_training as ttt


# Both ttt-mlp
ttt.ttt_forward(
    XQ_batch.contiguous(),
    XK_batch.contiguous(),
    XV_batch.contiguous(),
    last_eta.contiguous(),
    ttt_norm_weight.contiguous(),
    ttt_norm_bias.contiguous(),
    W1_init.contiguous(),
    b1_init.contiguous(),
    W2_init.contiguous(),
    b2_init.contiguous(),
    W1_checkpoints.contiguous(),
    b1_checkpoints.contiguous(),
    W2_checkpoints.contiguous(),
    b2_checkpoints.contiguous(),
    XQW_batch.contiguous(),
    checkpoint_group_size
)

ttt.ttt_backward(
    # Forward inputs
    XQ_batch.contiguous(),
    XK_batch.contiguous(),
    XV_batch.contiguous(),
    last_eta.contiguous(),
    ttt_norm_weight.contiguous(),
    ttt_norm_bias.contiguous(),
    # Checkpoints
    W1_checkpoints.contiguous(),
    b1_checkpoints.contiguous(),
    W2_checkpoints.contiguous(),
    b2_checkpoints.contiguous(),
    XQW_batch.contiguous(),
    # Rematted Buffers
    W1_init_group.contiguous(),
    b1_init_group.contiguous(),
    W2_init_group.contiguous(),
    b2_init_group.contiguous(),
    x_hat_ln_group.contiguous(),
    std_ln_group.contiguous(),
    X2_group.contiguous(),
    Z1_group.contiguous(),
    Z1_bar_group.contiguous(),
    X2_bar_group.contiguous(),
    grad_l_wrt_Z2_group.contiguous(),
    grad_l_wrt_Z1_group.contiguous(),
    x_hat_fused_group.contiguous(),
    grad_x_hat_fused_group.contiguous(),
    grad_output_fused_group.contiguous(),
    std_fused_group.contiguous(),
    # Upstream grads
    grad_L_W1_last.contiguous(),
    grad_L_b1_last.contiguous(),
    grad_L_W2_last.contiguous(),
    grad_L_b2_last.contiguous(),
    grad_L_XQW_batch.contiguous(),
    # Output grads
    grad_L_ttt_norm_weight.contiguous(),
    grad_L_ttt_norm_bias.contiguous(),
    grad_L_W1_init.contiguous(),
    grad_L_b1_init.contiguous(),
    grad_L_W2_init.contiguous(),
    grad_L_b2_init.contiguous(),
    grad_L_last_eta.contiguous(),
    grad_L_XQ.contiguous(),
    grad_L_XK.contiguous(),
    grad_L_XV.contiguous(),
    checkpoint_group_size
)

Note that these kernels do not support non-contiguous tensors.

Thunderkittens

This repository is forked from Thunderkittens (https://github.com/HazyResearch/ThunderKittens). Thunderkittens was used and modified for kernel development.

Installation

Installation requires CUDA drivers or toolkit (v 12.3+) and g++ v10+

Pip

pip install test_time_training

From source

source env.src
python setup.py install

Notes on implementation

These kernels use distributed shared memory to implement tensor-parallelism and sharding. The hidden states are sharded across SMs to save shared memory.

These kernels also use input staging and pipelining to hide latencies for global reads.

We also used mixed precision to perform the matmuls in bf16 for tensor core usage and also kept hidden state (and grads) accumulation and layer norm computation in float32.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

test_time_training-0.7.0-cp312-cp312-manylinux_2_31_x86_64.whl (10.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.31+ x86-64

test_time_training-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

File details

Details for the file test_time_training-0.7.0-cp312-cp312-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for test_time_training-0.7.0-cp312-cp312-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 b205556dec02edbe23f7946302903eff639e44adb8888ffa0a77512442d8c1bf
MD5 67eb322dee7c4e1e06a86fb47e701c46
BLAKE2b-256 c440f053e6e6f38b71bc40875724a2b7287d592c4bede319b02b02f278b6e8ca

See more details on using hashes here.

File details

Details for the file test_time_training-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for test_time_training-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0a82abd5a7d7d3e1be0f928cbb80491f6134aa1e6269b25eded54829880f991e
MD5 3fd068f29c9eb8277801843eab7b8107
BLAKE2b-256 fa37c0c6936bb5de67719b27746f75a680af665617162ae4b23220b7f5db4dfa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page