130 million TDEs per second, Python + CUDA TDEs from Nikkhoo and Walter 2015

Python + CUDA TDEs from Nikkhoo and Walter 2015

CUDA and OpenCL-enabled fullspace triangle dislocation elements. Benchmarked at 130 million TDEs per second. Based on the original MATLAB code from Nikhoo and Walter 2015.. In addition to the basic pair-wise TDE operations for displacement and strain, cutde also has:

• all pairs matrix construction functions.
• matrix free functions for low memory usage settings.
• block-wise functions that are especially helpful in an FMM or hierarchical matrix setting.
• a CUDA adaptive cross approximation implementation for building hierarchical matrices.

See below for basic usage and installation instructions. For more realistic usage examples, please check out the TDE sequence in the BIE book. You'll find examples of using all the above variants.

import matplotlib.pyplot as plt
import numpy as np

import cutde

xs = np.linspace(-2, 2, 200)
ys = np.linspace(-2, 2, 200)
obsx, obsy = np.meshgrid(xs, ys)
pts = np.array([obsx, obsy, 0 * obsy]).reshape((3, -1)).T.copy()

fault_pts = np.array([[-1, 0, 0], [1, 0, 0], [1, 0, -1], [-1, 0, -1]])
fault_tris = np.array([[0, 1, 2], [0, 2, 3]], dtype=np.int64)

disp_mat = cutde.disp_matrix(obs_pts=pts, tris=fault_pts[fault_tris], nu=0.25)

slip = np.array([[1, 0, 0], [1, 0, 0]])
disp = disp_mat.reshape((-1, 6)).dot(slip.flatten())

disp_grid = disp.reshape((*obsx.shape, 3))

plt.figure(figsize=(5, 5), dpi=300)
cntf = plt.contourf(obsx, obsy, disp_grid[:, :, 0], levels=21)
plt.contour(
obsx,
obsy,
disp_grid[:, :, 0],
colors="k",
linestyles="-",
linewidths=0.5,
levels=21,
)
plt.colorbar(cntf)
plt.title("\$u_x\$")
plt.tight_layout()
plt.savefig("docs/example.png", bbox_inches="tight") Usage documentation

Simple pair-wise TDEs

Computing TDEs for observation point/source element pairs is really simple:

import cutde

disp = cutde.disp(obs_pts, src_tris, slips, nu)
strain = cutde.strain(obs_pts, src_tris, slips, nu)
• obs_pts is a np.array with shape (N, 3)
• src_tris is a np.array with shape (N, 3, 3) where the second dimension corresponds to each vertex and the third dimension corresponds to the cooordinates of those vertices.
• slips is a np.array with shape (N, 3) where slips[:,0] is the strike slip component, while component 1 is the dip slip and component 2 is the tensile/opening component.
• the last parameter, nu, is the Poisson ratio.

IMPORTANT: N should be the same for all these arrays. There is exactly one triangle and slip value used for each observation point.

• The output disp is a (N, 3) array with displacement components in the x, y, z directions.
• The output strain is a (N, 6) array representing a symmetric tensor. strain[:,0] is the xx component of strain, 1 is yy, 2 is zz, 3 is xy, 4 is xz, and 5 is yz.

I want stress.

Use:

stress = cutde.strain_to_stress(strain, sm, nu)

to convert from stress to strain assuming isotropic linear elasticity. sm is the shear modulus and nu is the Poisson ratio.

All pairs interactions matrix

If, instead, you want to create a matrix representing the interaction between every observation point and every source triangle, there is a different interface:

import cutde

disp_mat = cutde.disp_matrix(obs_pts, src_tris, nu)
strain_mat = cutde.strain_matrix(obs_pts, src_tris, nu)
• obs_pts is a np.array with shape (N_OBS_PTS, 3)
• src_tris is a np.array with shape (N_SRC_TRIS, 3, 3) where the second dimension corresponds to each vertex and the third dimension corresponds to the cooordinates of those vertices.
• the last parameter, nu, is the Poisson ratio.
• The output disp_mat is a (N_OBS_PTS, 3, N_SRC_TRIS, 3) array. The second dimension corresponds to the components of the observed displacement while the fourth dimension corresponds to the component of the source slip vector. The slip vector components are ordered the same way as in cutde.disp and cutde.strain.
• The output strain_mat is a (N_OBS_PTS, 6, N_SRC_TRIS, 3) array. Like above, the dimension corresponds to the components of the observation strain with the ordering identical to cutde.strain.

Note that to use the strain_to_stress function, you'll need to re-order the axes of the strain array. You can do this with np.transpose(...).

Matrix-free all pairs interactions

A common use of the matrices produced above by cutde.disp_matrix would be to perform matrix-vector products with a input vector with (N_SRC_TRIS * 3) entries and an output vector with (N_OBS_PTS * 6) entries. But, building the entire matrix can require a very large amount of memory. In some situations, it's useful to compute matrix-vector products without ever computing the matrix itself, a so-called "matrix-free" operation. In order to do this, the matrix entries are recomputed whenever they are needed. As a result, performing a matrix-vector product is much slower -- on my machine, about 20x slower. But, the trade-off may be worthwhile if you are memory-constrained.

disp = cutde.disp_free(obs_pts, src_tris, slips, nu)
strain = cutde.strain_free(obs_pts, src_tris, slips, nu)

The parameters are the same as for cutde.disp_matrix with the addition of slips. The slips array is a (N_SRC_TRIS, 3) array containing the source slip vectors.

Block-wise interaction matrices

In some settings, it is useful to compute many sub-blocks of a matrix without computing the full matrix. For example, this is useful for the nearfield component of a hierarchical matrix or fast multipole approximation.

disp_matrices, block_idxs = cutde.disp_block(
obs_pts, src_tris, obs_start, obs_end, src_start, src_end, nu
)
strain_matrices, strain_block_idxs = cutde.strain_block(
obs_pts, src_tris, obs_start, obs_end, src_start, src_end, nu
)
• obs_pts, src_tris and nu are the same as for disp_matrix.
• obs_starts and obs_end are arrays with N_BLOCKS elements representing the first and last observation point indices in each block.
• src_starts and src_end are arrays with N_BLOCKS elements representing the first and last source triangle indices in each block.

The output disp_matrices and strain_matrices will be a densely packed representation with each block's boundaries demarcated by block_idxs. As an example of extracting a single block:

disp_matrices, block_idxs = cutde.disp_block(obs_pts, src_tris, [0, 5], [5, 10], [0, 2], [2, 4], nu)
block1 = disp_matrices[block_idxs:block_idxs].reshape((5, 3, 2, 3))

Sometimes the matrix blocks we want to compute represent far-field interactions where the observation points are all sufficiently far away and separated as a group from the source triangles. In this situation, the matrix blocks are approximately low rank. An approximate matrix will require much less storage space and allow for more efficient matrix-vector products. Adaptive cross approximation is an algorithm for computing such a low rank representation. See Grasedyck 2005 for an accessible and general introduction to ACA. Or, see the ACA section here for an introduction that builds up to using the cutde.disp_aca implementation.

disp_appxs = cutde.disp_aca(
obs_pts, tris, obs_start, obs_end, src_start, src_end, nu, tol, max_iter
)

The parameters are the same as cutde.disp_block with the addition of tol and max_iter. The tolerance, tol, is specified as an array of length N_BLOCKS in terms of the Frobenius norm of the error matrix between the true matrix and the approximation. The algorithm is not guaranteed to reach the specified tolerance but should come very close. The maximum number of iterations (equal to the maximum rank of the approximation) is also specified as an array of length N_BLOCKS.

The output disp_appxs will be a list of (U, V) pairs representing the left and right vectors of the low rank approximation. To approximate a matrix vector product:

U, V = disp_appxs
y = U.dot(V.dot(x))

Installation

To install cutde itself run:

pip install cutde

Then, install either PyCUDA or PyOpenCL following the directions below.

PyCUDA

If you have an NVIDIA GPU, install PyCUDA with:

conda config --prepend channels conda-forge
conda install -c conda-forge pycuda

Mac OS X

Install PyOpenCL and the PoCL OpenCL driver with:

conda config --prepend channels conda-forge
conda install pocl pyopencl

Ubuntu + PyOpenCL/PoCL

Just like on a Mac:

conda config --prepend channels conda-forge
conda install pocl pyopencl

Ubuntu + PyOpenCL with system drivers**

conda install pyopencl ocl-icd ocl-icd-system

You will need to install the system OpenCL drivers yourself depending on the hardware you have. See the "Something else" section below.

Windows

I'm not aware of anyone testing cutde on Windows yet. It should not be difficult to install. I would expect that you install pyopencl via conda and then install the OpenCL libraries and drivers that are provided by your hardware vendor. See the "Something else" section below.

Something else

I'd suggest starting by trying the instructions for the system most similar to yours above. If that doesn't work, never fear! OpenCL should be installable on almost all recent hardware and typical operating systems. These directions can be helpful.. I am happy to try to help if you have OpenCL installation issues, but I can't promise to be useful.

Why can't I use Apple CPU OpenCL?

You might have gotten the message: cutde does not support the Apple CPU OpenCL implementation and no other platform or device was found. Please consult the cutde README.

The Apple OpenCL implementation for Intel CPUs has very poor support for the OpenCL standard and causes lots of difficult-to-resolve errors. Instead, please use the PoCL implementation. You can install it with conda install -c conda-forge pocl.

Development

For developing cutde, clone the repo and set up your conda environment based on the environment.yml with:

conda env create

Next, install either pycuda or pyopencl as instructed in the Installation section above.

Then, you should re-generate the baseline test data derived from the MATLAB code from Mehdi Nikhoo. To do this, first install octave. On Ubuntu, this is just:

sudo apt-get install octave

And run

./tests/setup_test_env

which will run the tests/matlab/gen_test_data.m script.

Finally, to check that cutde is working properly, run pytest!

The library is extremely simple:

• cutde.fullspace - the main entrypoint.
• fullspace.cu - a direct translation of the original MATLAB into CUDA/OpenCL. This probably should not be modified.
• cutde.gpu - a layer that abstracts between CUDA and OpenCL
• cutde.cuda - the PyCUDA interface.
• cutde.opencl - the PyOpenCL interface.

The tests/tde_profile.py script is useful for assessing performance.

Some tests are marked as slow. To run these, run pytest --runslow.

If you have both CUDA and OpenCL installed, cutde will default to using CUDA. To use OpenCL instead, run export CUTDE_USE_OPENCL=1 to set the environment flag before launching the Python session that will use cutde.

The README.md is auto-generated from a template in docs/. To run this process, run docs/build_docs.

Project details

This version 21.8.22 21.8.20 21.8.13 21.8.11 21.6.14 21.6.12 21.4.13 21.4.5.5 21.4.5.4 21.4.5.3 21.4.5.2 21.4.5.1 21.4.5 21.4.1 21.3.31.1 21.3.31 18.5.29 18.5.17.2 18.5.15 18.5.10.2 18.5.10.1