Vulkan Kompute: Blazing fast, mobile-enabled, asynchronous, and optimized for advanced GPU processing usecases.
Project description
Vulkan KomputeThe General Purpose Vulkan Compute Framework for C++ and Python. |
Blazing fast, mobile-enabled, asynchronous, and optimized for advanced GPU processing usecases.
🔋 Documentation 💻 Blog Post ⌨ Examples 💾
Principles & Features
- Single header for simple import with flexible build-system configuration
- Multi-language support with C++ as core SDK as well as optimized Python bindings
- Asynchronous & parallel processing support through GPU family queues
- Mobile enabled with examples in Android studio across several architectures
- BYOV: Bring-your-own-Vulkan design to play nice with existing Vulkan applications
- Explicit relationships for GPU and host memory ownership and memory management
- Short code examples showing the core features
- Longer tutorials for machine learning 🤖, mobile development 📱 and game development 🎮.
Getting Started
Setup
Kompute is provided as a single header file Kompute.hpp
. See build-system section for configurations available.
Your First Kompute (Simple Version)
This simple example will show the basics of Kompute through the high level API.
- Create Kompute Manager with default settings (device 0 and first compute compatible queue)
- Create and initialise Kompute Tensors through manager
- Run multiplication operation synchronously
- Map results back from GPU memory to print the results
View the extended version or more examples.
int main() {
// 1. Create Kompute Manager with default settings (device 0 and first compute compatible queue)
kp::Manager mgr;
// 2. Create and initialise Kompute Tensors through manager
auto tensorInA = mgr.buildTensor({ 2., 2., 2. });
auto tensorInB = mgr.buildTensor({ 1., 2., 3. });
auto tensorOut = mgr.buildTensor({ 0., 0., 0. });
// 3. Run multiplication operation synchronously
mgr.evalOpDefault<kp::OpMult<>>(
{ tensorInA, tensorInB, tensorOut })
// 4. Map results back from GPU memory to print the results
mgr.evalOpDefault<kp::OpTensorSyncLocal>({ tensorInA, tensorInB, tensorOut })
// Prints the output which is Output: { 2, 4, 6 }
std::cout << fmt::format("Output: {}",
tensorOut.data()) << std::endl;
}
Your First Kompute (Extended Version)
We will now show the same example as above but leveraging more advanced Kompute features:
- Create Kompute Manager with explicit device 0 and single queue of familyIndex 2
- Explicitly create Kompute Tensors without initializing in GPU
- Initialise the Kompute Tensor in GPU memory and map data into GPU
- Run operation with custom compute shader code asynchronously with explicit dispatch layout
- Create managed sequence to submit batch operations to the CPU
- Map data back to host by running the sequence of batch operations
View more examples.
int main() {
// 1. Create Kompute Manager with explicit device 0 and single queue of familyIndex 2
kp::Manager mgr(0, { 2 });
// 2. Explicitly create Kompute Tensors without initializing in GPU
auto tensorInA = std::make_shared<kp::Tensor>(kp::Tensor({ 2., 2., 2. }));
auto tensorInB = std::make_shared<kp::Tensor>(kp::Tensor({ 1., 2., 3. }));
auto tensorOut = std::make_shared<kp::Tensor>(kp::Tensor({ 0., 0., 0. }));
// 3. Initialise the Kompute Tensor in GPU memory and map data into GPU
mgr.evalOpDefault<kp::OpTensorCreate>({ tensorInA, tensorInB, tensorOut });
// 4. Run operation with custom compute shader code asynchronously with explicit dispatch layout
mgr.evalOpAsyncDefault<kp::OpAlgoBase<3, 1, 1>>(
{ tensorInA, tensorInB, tensorOut },
shaderData); // "shaderData" defined is below and can be glsl/spirv string, or path to file
// 4.1. Before submitting sequence batch we wait for the async operation
mgr.evalOpAwaitDefault();
// 5. Create managed sequence to submit batch operations to the CPU
std::shared_ptr<kp::Sequence> sq = mgr.getOrCreateManagedSequence("seq");
// 5.1. Explicitly begin recording batch commands
sq->begin();
// 5.2. Record batch commands
sq->record<kp::OpTensorSyncLocal({ tensorInA });
sq->record<kp::OpTensorSyncLocal({ tensorInB });
sq->record<kp::OpTensorSyncLocal({ tensorOut });
// 5.3. Explicitly stop recording batch commands
sq->end();
// 6. Map data back to host by running the sequence of batch operations
sq->eval();
// Prints the output which is Output: { 2, 4, 6 }
std::cout << fmt::format("Output: {}",
tensorOut.data()) << std::endl;
}
Your shader can be provided as raw glsl/hlsl string, SPIR-V bytes array (using our CLI), or string path to file containing either. Below are the examples of the valid ways of providing shader.
Passing raw GLSL/HLSL string
static std::string shaderString = (R"(
#version 450
layout (local_size_x = 1) in;
// The input tensors bind index is relative to index in parameter passed
layout(set = 0, binding = 0) buffer bina { float tina[]; };
layout(set = 0, binding = 1) buffer binb { float tinb[]; };
layout(set = 0, binding = 2) buffer bout { float tout[]; };
void main() {
uint index = gl_GlobalInvocationID.x;
tout[index] = tina[index] * tinb[index];
}
)");
static std::vector<char> shaderData(shaderString.begin(), shaderString.end());
Passing SPIR-V Bytes array
You can use the Kompute shader-to-cpp-header CLI to convert your GLSL/HLSL or SPIR-V shader into C++ header file (see documentation link for more info). This is useful if you want your binary to be compiled with all relevant artifacts.
static std::vector<uint8_t> shaderData = { 0x03, //... spirv bytes go here)
Path to file containing raw glsl/hlsl or SPIRV bytes
static std::string shaderData = "path/to/shader.glsl";
// Or SPIR-V
static std::string shaderData = "path/to/shader.glsl.spv";
More examples
Simple examples
- Pass shader as raw string
- Record batch commands with a Kompute Sequence
- Run Asynchronous Operations
- Run Parallel Operations Across Multiple GPU Queues
- Create your custom Kompute Operations
- Implementing logistic regression from scratch
End-to-end examples
- Machine Learning Logistic Regression Implementation
- Parallelizing GPU-intensive Workloads via Multi-Queue Operations
- Android NDK Mobile Kompute ML Application
- Game Development Kompute ML in Godot Engine
Architectural Overview
The core architecture of Kompute includes the following:
- Kompute Manager - Base orchestrator which creates and manages device and child components
- Kompute Sequence - Container of operations that can be sent to GPU as batch
- Kompute Operation (Base) - Base class from which all operations inherit
- Kompute Tensor - Tensor structured data used in GPU operations
- Kompute Algorithm - Abstraction for (shader) code executed in the GPU
To see a full breakdown you can read further in the C++ Class Reference.
Full Vulkan Components | Simplified Kompute Components |
---|---|
(very tiny, check the full reference diagram in docs for details) |
Asynchronous and Parallel Operations
Kompute provides flexibility to run operations in an asynrchonous way through Vulkan Fences. Furthermore, Kompute enables for explicit allocation of queues, which allow for parallel execution of operations across queue families.
The image below provides an intuition on how Kompute Sequences can be allocated to different queues to enable parallel execution based on hardware. You can see the hands on example, as well as the detailed documentation page describing how it would work using an NVIDIA 1650 as an example.
Mobile Enabled
Kompute has been optimized to work in mobile environments. The build system enables for dynamic loading of the Vulkan shared library for Android environments, together with a working Android NDK Vulkan wrapper for the CPP headers.
For a full deep dive you can read the blog post "Supercharging your Mobile Apps with On-Device GPU Accelerated Machine Learning". You can also access the end-to-end example code in the repository, which can be run using android studio. |
Python Package
Besides the C++ core SDK you can also use the Python package of Kompute, which exposes the same core functionality, and supports interoperability with Python objects like Lists, Numpy Arrays, etc.
You can install from the repository by running:
pip install .
For further details you can read the Python Package documentation or the Python Class Reference documentation.
Python Example (Simple)
Then you can interact with it from your interpreter. Below is the same sample as above "Your First Kompute (Simple Version)" but in Python:
mgr = Manager()
# Can be initialized with List[] or np.Array
tensor_in_a = Tensor([2, 2, 2])
tensor_in_b = Tensor([1, 2, 3])
tensor_out = Tensor([0, 0, 0])
mgr.eval_tensor_create_def([tensor_in_a, tensor_in_b, tensor_out])
shaderFilePath = "shaders/glsl/opmult.comp"
mgr.eval_async_algo_file_def([tensor_in_a, tensor_in_b, tensor_out], shaderFilePath)
# Alternatively can pass raw string/bytes:
# shaderFileData = """ shader code here... """
# mgr.eval_algo_data_def([tensor_in_a, tensor_in_b, tensor_out], list(shaderFileData))
mgr.eval_await_def()
mgr.eval_tensor_sync_local_def([tensor_out])
assert tensor_out.data() == [2.0, 4.0, 6.0]
Python Example (Extended)
Similarly you can find the same extended example as above:
mgr = Manager(0, [2])
# Can be initialized with List[] or np.Array
tensor_in_a = Tensor([2, 2, 2])
tensor_in_b = Tensor([1, 2, 3])
tensor_out = Tensor([0, 0, 0])
shaderFilePath = "../../shaders/glsl/opmult.comp"
mgr.eval_tensor_create_def([tensor_in_a, tensor_in_b, tensor_out])
seq = mgr.create_sequence("op")
mgr.eval_async_algo_file_def([tensor_in_a, tensor_in_b, tensor_out], shaderFilePath)
mgr.eval_await_def()
seq.begin()
seq.record_tensor_sync_local([tensor_in_a])
seq.record_tensor_sync_local([tensor_in_b])
seq.record_tensor_sync_local([tensor_out])
seq.end()
seq.eval()
assert tensor_out.data() == [2.0, 4.0, 6.0]
For further details you can read the Python Package documentation or the Python Class Reference documentation.
Build Overview
The build system provided uses cmake
, which allows for cross platform builds.
The top level Makefile
provides a set of optimized configurations for development as well as the docker image build, but you can start a build with the following command:
cmake -Bbuild
You also are able to add Kompute in your repo with add_subdirectory
- the Android example CMakeLists.txt file shows how this would be done.
For a more advanced overview of the build configuration check out the Build System Deep Dive documentation.
Kompute Development
We appreciate PRs and Issues. If you want to contribute try checking the "Good first issue" tag, but even using Vulkan Kompute and reporting issues is a great contribution!
Contributing
Dev Dependencies
- Testing
- GTest
- Documentation
- Doxygen (with Dot)
- Sphynx
Development
- Follows Mozilla C++ Style Guide https://www-archive.mozilla.org/hacking/mozilla-style-guide.html
- Uses post-commit hook to run the linter, you can set it up so it runs the linter before commit
- All dependencies are defined in vcpkg.json
- Uses cmake as build system, and provides a top level makefile with recommended command
- Uses xxd (or xxd.exe windows 64bit port) to convert shader spirv to header files
- Uses doxygen and sphinx for documentation and autodocs
- Uses vcpkg for finding the dependencies, it's the recommended set up to retrieve the libraries
Updating documentation
To update the documentation you will need to:
- Run the gendoxygen target in the build system
- Run the gensphynx target in the build-system
- Push to github pages with
make push_docs_to_ghpages
Running tests
To run tests you can use the helper top level Makefile
For visual studio you can run
make vs_cmake
make vs_run_tests VS_BUILD_TYPE="Release"
For unix you can run
make mk_cmake MK_BUILD_TYPE="Release"
make mk_run_tests
Motivations
This project started after seeing that a lot of new and renowned ML & DL projects like Pytorch, Tensorflow, Alibaba DNN, Tencent NCNN - among others - have either integrated or are looking to integrate the Vulkan SDK to add mobile (and cross-vendor) GPU support.
The Vulkan SDK offers a great low level interface that enables for highly specialized optimizations - however it comes at a cost of highly verbose code which requires 500-2000 lines of code to even begin writing application code. This has resulted in each of these projects having to implement the same baseline to abstract the non-compute related features of Vulkan. This large amount of non-standardised boiler-plate can result in limited knowledge transfer, higher chance of unique framework implementation bugs being introduced, etc.
We are currently developing Vulkan Kompute not to hide the Vulkan SDK interface (as it's incredibly well designed) but to augment it with a direct focus on Vulkan's GPU computing capabilities. This article provides a high level overview of the motivations of Kompute, together with a set of hands on examples that introduce both GPU computing as well as the core Vulkan Kompute architecture.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kompute-0.4.1.tar.gz
.
File metadata
- Download URL: kompute-0.4.1.tar.gz
- Upload date:
- Size: 2.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42bf206ac885ed8f3733e4167409ee49791a22a2750cba300634db1b39286295 |
|
MD5 | e9d6ab80ae0a89d0c3941d2a9fb27fba |
|
BLAKE2b-256 | 88deea432a0d7680fc676740b82fd17466c4980153556b8e7850da051058c954 |
File details
Details for the file kompute-0.4.1-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: kompute-0.4.1-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 201.3 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f1441994c942c45d8e87d3148015e3458d1a01cd71f6333f6eca72c90bd8d54 |
|
MD5 | 87c20a1c73dd9622e315cbe8735d09ed |
|
BLAKE2b-256 | ca6d9038901b9017977dc628f48a2512dcb0c57e5c5c7f4ca6422b9c4e16a44b |