Tensor library with automatic kernel fusion
Project description
🥶 TensorFrost (v0.1.3) 🥶
Yet another Python tensor library with autodifferentiation (TODO). Currently very much a work in progress.
The main idea of this library is to compile optimal fused kernels for the GPU given a set of numpy-ish functions, including other more complex things like loops and conditionals (also TODO)
Currently working platforms:
Backend/OS | CPU | CUDA | Vulkan |
---|---|---|---|
Windows | 🚧 | ⛔ | ⛔ |
Linux | ⛔ | ⛔ | ⛔ |
Examples
Installation
From PyPI
pip install tensorfrost
From source
You need to have CMake installed to build the library.
First clone the repository:
git clone --recurse-submodules https://github.com/MichaelMoroz/TensorFrost.git
cd TensorFrost
Then run cmake to build the library:
cmake -S . -B build && cmake --build build
The cmake script will automatically install the compiled python module into your python environment.
Building wheel packages (optional)
You can either call clean_rebuild.bat %PYTHON_VERSION%
to build the wheel packages for the specified python version (the version needs to be installed beforehand), or you can build them for all versions by calling build_all_python_versions.bat
. The scripts will automatically build and install the library for each python version, and then build the wheel packages to the PythonBuild/dist
folder.
Usage
Setup
For the library to work you need a C++ compiler that supports C++17 (Currently only Microsoft Visual Studio Compiler).
First you need to import the library:
import TensorFrost as tf
Then you need to initialize the library with the device you want to use and the kernel compiler flags (different for each platform):
tf.initialize(tf.cpu, "/O2 /fp:fast /openmp") # Windows + MSVC (currently the only working compiler out of the box)
TensorFrost will find any available MSVC installation and use it to compile the kernels. If you want to use a different compiler, you can specify the path to the compiler executable (TODO).
Basic usage
Now you can create and compile functions, for example here is a very simple function does a wave simulation:
def WaveEq():
#shape is not specified -> shape is inferred from the input tensor (can result in slower execution)
u = tf.input([-1, -1], tf.float32)
#shape must match
v = tf.input(u.shape, tf.float32)
i,j = u.indices
laplacian = u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1] - u * 4.0
v_new = v + dt*laplacian
u_new = u + dt*v_new
return [v_new, u_new]
wave_eq = tf.compile(WaveEq)
The tensor programs take and output tensor memory buffers, which can be created from numpy arrays:
A = tf.tensor(np.zeros([100, 100], dtype=np.float32))
B = tf.tensor(np.zeros([100, 100], dtype=np.float32))
Then you can run the program:
A, B = wave_eq(A, B)
To get the result back into a numpy array, you can use the numpy
property:
Anp = A.numpy
TensorFrost does not support JIT compilation (currently no plans either), so you must create the program before running it. Therefore the tensor operations must only be used inside a tensor program. Operations outside the function will throw an error, so if you want to do operations outside you must read the data into a numpy array first.
Operations
TensorFrost supports most of the basic numpy operations, including indexing, arithmetic, and broadcasting (only partially for now).
The core operation is the indexing operation, which is used to specify indices for accessing the tensor data. Depending on the dimensinality of the tensor there can be N indices. This operation is similar to numpy's np.ogrid
and np.mgrid
functions, but it is basically free due to fusion.
#can be created either from a provided shape or from a tensor
i,j = tf.indices([8, 8])
i,j = A.indices
For example i
contains:
[[0, 0, 0, ..., 0, 0, 0],
[1, 1, 1, ..., 1, 1, 1],
[2, 2, 2, ..., 2, 2, 2],
...,
[7, 7, 7, ..., 7, 7, 7]]
And analogously for j
.
These indices can then be used to index into the tensor data, to either read or write data:
#set elements [16:32, 16:32] to 1.0
i,j = tf.indices([16, 16])
B[i+16, j+16] = 1.0
#read elements [8:24, 8:24]
i,j = tf.indices([16, 16])
C = B[i+8, j+8]
Here we can see that the shape of the "computation" is not the same as the shape of the tensor, and one thread is spawned for each given index. This is the main idea of TensorFrost. Then all sequential computations of the same shape are fused into a single kernel, if they are not dependent on each other in a non-trivial way.
When doing out-of-bounds indexing, the index is currently clamped to the tensor shape. This is not ideal, but it is the simplest way to handle this. In the future there will be a way to specify the boundary conditions.
Scatter operations
These operations allow implementing non-trivial reduction operations, including, for example, matrix multiplication:
def MatrixMultiplication():
A = tf.input([-1, -1], tf.float32)
N, M = A.shape
B = tf.input([M, -1], tf.float32) #M must match
K = B.shape[1]
C = tf.zeros([N, K])
i, j, k = tf.indices([N, K, M])
tf.scatterAdd(C[i, j], A[i, k] * B[k, j])
return [C]
matmul = tf.compile(MatrixMultiplication)
Here the 3D nature of the matrix multiplication is apparent. The scatter operation is used to accumulate the results of the row-column dot products into the elements of the resulting matrix.
(This is not the most efficient way to implement matrix multiplication, but it is the simplest way to show how scatter operations work. In the future though, some dimensions will be converted into loop indices, and the scatter operation will be used to accumulate the results of the dot products into the resulting matrix.)
Loops and conditionals
TODO
Autodifferentiation
TODO
Advanced usage
TODO
Roadmap
Core features:
- Basic operations (memory, indexing, arithmetic, etc.)
- Basic kernel fusion and compilation
- Advanced built-in functions (random, special functions, etc.)
- Advanced operations (loops, conditionals, etc.)
- Autodifferentiation
- Kernel code and execution graph export and editing
- Advanced data types and quantization
- Advanced IR optimizations
- Kernel shape and cache optimization
- Compile from Python AST instead of tracing
Algorithm library:
- Sort, scan, reduction, etc.
- Matrix operations (matrix multiplication, etc.)
- Advanced matrix operations (QR, SVD, eigenvalues, etc.)
- Fast Fourier Transform
- High-level neural network layers (convolution, etc.)
Platforms:
- Windows
- Linux
- MacOS
Backends:
- CPU (using user-provided compiler)
- ISPC (for better CPU utilization)
- Vulkan
- CUDA
- WGPU (for web)
(hopefully im not going to abandon this project before finishing lol)
Contributing
Contributions are welcome! If you want to contribute, please open an issue first to discuss the changes you want to make.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for TensorFrost-0.1.3-cp312-cp312-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e34cc3e2adc3462b08152838cce80f73712efb3d74bc31fcce71f48d0d823cfb |
|
MD5 | 60df66d1e726b008ab5adda26f485068 |
|
BLAKE2b-256 | bd2e86af4617db9759f0e95880b87c7d4e1d49ce967ff5c4e36b0b099501a685 |
Hashes for TensorFrost-0.1.3-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 657ab42ee8f24a922757dba7753a87b71b437aba9557468b8642898e99f7bffe |
|
MD5 | 9366c36f312f1d656c2f3c5757c65866 |
|
BLAKE2b-256 | c278ef449dbfbdea6f40ec158b4c21eef3d5dbcbdbf90a32f3ea9db56a1bceb1 |
Hashes for TensorFrost-0.1.3-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 146da8ac98395ad6b70482ca60ef81ecfb146f0ac627c7c9b6358d70874457bd |
|
MD5 | 58c805d38ef521ff203fe225ffca669a |
|
BLAKE2b-256 | bddbac58e203fee91e90b3952a5af30699a692fe0d71d4320a4f2ca10517d808 |
Hashes for TensorFrost-0.1.3-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 434e2a3deaf0e80ea308d964f0900a4a13f79ab2b88cf814ce3bf292b01a7959 |
|
MD5 | 8317030988922e2e80290727dd27eba6 |
|
BLAKE2b-256 | a02ec86903f32556aa3a54afdae2be0ead168cab158f1353aed362fc12f4b8a4 |
Hashes for TensorFrost-0.1.3-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffb8f9e9dec2aa0d6d47defa5f835da694f9b9e407a4e6e9fddd155fb7230952 |
|
MD5 | 06e02b4aa4a7aa4629ec294d557297e8 |
|
BLAKE2b-256 | 8c52cd38c27941ce9d6b26b8774f696beddfc67eb231d124fa2d51565a2d2488 |
Hashes for TensorFrost-0.1.3-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c0531ff5d00d167cbf0709228b479fb0075ba0fd0cbc98bfec0088ecdc25ea59 |
|
MD5 | 9085912493a3d053df273594885347fa |
|
BLAKE2b-256 | 4b170950d09c378edad05dda5d5bc3e9d7c9cf71ea9e79f4aef30b589c692a8c |