Tracking, debugging, and patching non-determinism in TensorFlow

These details have not been verified by PyPI

Project links

Homepage

Project description

TensorFlow Determinism

This repository serves three purposes:

Provide up-to-date information (in this file) about non-determinism sources and solutions in TensorFlow and beyond, with a focus on determinism when running on GPUs.
Provide a patch to attain various levels of GPU-specific determinism in stock TensorFlow, via the installation of the tensorflow-determinism pip package.
Be the location where a TensorFlow determinism debug tool will be released as part of the tensorflow-determinism pip package.

For more information, please watch the video of the GTC 2019 talk Determinism in Deep Learning. The desciption under that video also includes links to the slides from the talk and to a poster presentation on this topic.

Installation

Use pip to install:

pip install tensorflow-determinism

This will install a package that can be imported as tfdeterminism. The installation of tensorflow-determinism will not automatically install TensorFlow. The intention of this is to allow you to install your chosen version of TensorFlow. You will need to install your chosen version of TensorFlow before you can import and use tfdeterminism.

Deterministic TensorFlow Solutions

There are currently two main ways to access GPU-deterministic functionality in TensorFlow for most deep learning applications. The first way is to use an NVIDIA NGC TensorFlow container. The second way is to use version 1.14, 1.15, or 2.0 of stock TensorFlow with GPU support, plus the application of a patch supplied in this repo.

The longer-term intention and plan is to upstream all solutions into stock TensorFlow.

Determinism is not guaranteed when XLA JIT compilation is enabled.

NVIDIA NGC TensorFlow Containers

NGC TensorFlow containers, starting with version 19.06, implement GPU-deterministic TensorFlow functionality. In Python code running inside the container, this can be enabled as follows:

import tensorflow as tf
import os
os.environ['TF_DETERMINISTIC_OPS'] = '1'
# Now build your graph and train it

The following table shows which version of TensorFlow each NGC container version is based on:

NGC Container Version	TensorFlow Version
19.06	1.13
19.07 - 19.09	1.14

For information about pulling and running the NVIDIA NGC containers, see these instructions.

Stock TensorFlow

Versions 1.14, 1.15, and 2.0 of stock TensorFlow implement a reduced form of GPU determinism, which must be supplemented with a patch provided in this repo. The following Python code is running on a machine in which pip package tensorflow-gpu=2.0.0 has been installed correctly and on which tensorflow-determinism has also been installed (as shown in the installation section above).

import tensorflow as tf
from tfdeterminism import patch
patch()
# use tf as normal

Stock TensorFlow with GPU support can be installed as follows:

pip install tensorflow-gpu=2.0.0

The TensorFlow project includes detailed instructions for installing TensorFlow with GPU support.

Additional Ingredients in the Determinism Recipe

You'll also need to set any and all appropriate random seeds:

os.environ['PYTHONHASHSEED']=str(SEED)
random.seed(SEED)
np.random.seed(SEED)
tf.set_random_seed(SEED)

If you're using Horovod for multi-GPU training, you may need to disable Tensor Fusion (assuming that the non-determinism associated with Tensor Fusion has not yet been resolved):

os.environ['HOROVOD_FUSION_THRESHOLD']='0'

Detailed Status of Determinism in TensorFlow and Beyond

Confirmed and likely sources of non-determinism, along with any existing solutions, are being tracked here.

GPU-Specific Sources of Non-Determinism

Historic GPU-Specific Sources of Non-Determinism

In the past, tf.math.reduce_sum and tf.math.reduce_mean operated non-deterministically when running on a GPU. This was resolved before TensorFlow version 1.12. These ops now function deterministically by default when running on a GPU.

Confirmed Current GPU-Specific Sources of Non-Determinism (With Solutions)

Source	NGC 19.06+ / TF 2.1	TF 1.14, 1.15, 2.0
TF auto-tuning of cuDNN convolution algorithms	TCD or TDO	TCD or TDP
cuDNN convolution backprop to weight gradients	TCD or TDO	TCD or TDP
cuDNN convolution backprop to data gradients	TCD or TDO	TCD or TDP
cuDNN max-pooling backprop	TCD or TDO	TCD or TDP
`tf.nn.bias_add` backprop (see XLA note)	TDO	TDP
`tf.image.resize_bilinear` fwd and bwd	NS1	NS1

Key to the solutions refenced above:

Solution	Description
TCD	Set environment variable `TF_CUDNN_DETERMINISTIC` to '1' or 'true'. Also do not set environment variable `TF_USE_CUDNN_AUTOTUNE` at all (and particularly do not set it to '0' or 'false').
TDO	Set environment variable `TF_DETERMINISTIC_OPS` to '1' or 'true'. Also do not set environment variable `TF_USE_CUDNN_AUTOTUNE` at all (and particularly do not set it to '0' or 'false').
TDP	Apply `tfdeterminism.patch`. Note that solution TDO will be in stock TensorFlow v2.1 (see PR 31465).
NS1	There is currently no solution available for this, but one is under development.

Notes:

XLA: These solutions will not work when XLA JIT compilation is enabled.

Other Possible GPU-Specific Sources of Non-Determinism

Going beyond the above-mentioned sources, in version 1.12 of TensorFlow (and also in the master branch on 2019-03-03, afer release 1.31.1), the following files call CUDA atomicAdd either directly or indirectly. This makes them candidates for the injection of non-determinism.

crop_and_resize_op_gpu.cu.cc
scatter_functor_gpu.cu.h
scatter_nd_op_gpu.cu.cc
sparse_tensor_dense_matmul_op_gpu.cu.cc
resize_nearest_neighbor_op_gpu.cu.cc
segment_reduction_ops.h
segment_reduction_ops_gpu.cu.cc
dilation_ops_gpu.cu.cc
maxpooling_op_gpu.cu.cc
svd_op_gpu.cu.cc
cuda_kernel_helper_test.cu.cc
depthwise_conv_op_gpu.h
resampler_ops_gpu.cu.cc
histogram_op_gpu.cu.cc
stateful_random_ops_gpu.cu.cc

Unless you are using TensorFlow ops that depend on these files (i.e. ops with similar names), then your model will not be affected by these potential sources of non-determinism.

Beyond atomicAdd, there are ten other CUDA atomic functions whose use could lead to the injection of non-determinism, such as atomicCAS (the most generic, atomic compare and swap). Note also that the word 'atomic' was present in 167 files in the TensorFlow repo and some of these may be related to the use of CUDA atomic operations. It's important to remember that it's possible to use CUDA atomic operations without injecting non-determinism, and that, therefore, when CUDA atomic operations are present in op code, it doesn't guarantee that the op injects non-determinism into the computation.

Sources of Non-Determinism in TensorFlow Unrelated to GPU

Issue 29101: Random seed not set in graph context of Dataset#map. This may have been resolved in version 1.14 of TensorFlow.
tf.data.Dataset with more than one worker. The work-around is to use only one worker.

Sources of Non-Determinism Beyond TensorFlow

TensorRT timing-based kernel schedule. Each time an inference engine is generated, it could be slightly different, particularly if there is varying load on the machine used to run TensorRT. There is a solution planned for this.
Horovod Tensor Fusion. Work-around: disable Tensor Fusion by setting the environment variable HOROVOD_FUSION_THRESHOLD to '0'. This issue may have been resolved by Horovod pull-request 1130 (not yet confirmed).

Relevant Links

This section catalogs relevant links.

TensorFlow Issues

Number	Title	Updated
2652	Backward pass of broadcasting on GPU is non-deterministic	2019-10-08
2732	Mention that GPU reductions are nondeterministic in docs	2019-10-08
13932	Non-determinism from `tf.data.Dataset.map` with random ops
16889	Problems Getting TensorFlow to behave Deterministically	2019-10-08
18096	Feature Request: Support for configuring deterministic options of cuDNN conv routines	2019-10-08
29101	Random seed not set in graph context of `Dataset#map`

TensorFlow Pull Requests

Number	Title	Status	Updated
10636	Non-determinism Docs	closed (not merged)	2019-10-08
24273	Enable dataset.map to respect seeds from the outer context	closed (not merged)	N/A
24747	Add cuDNN deterministic env variable (only for convolution).	merged pre-1.14	N/A
25269	Add deterministic cuDNN max-pooling	merged pre-1.14	N/A
25796	Added tests for `TF_CUDNN_DETERMINISTIC`	merged pre-1.14	N/A
29667	Add release note about `TF_CUDNN_DETERMINISTIC`	merged into r1.14	N/A
31389	Enhance release notes related to `TF_CUDNN_DETERMINISTIC`	merged into r1.14	N/A
31465	Add GPU-deterministic `tf.nn.bias_add`	merged pre-2.1	N/A
32979	Fix typo in release note	closed (not merged)	N/A
33483	Fix small typo in v2.0.0 release note		N/A

Miscellaneous

Two Sigma: A Workaround for Non-Determinism in TensorFlow
Keras issue 12800: Unable to get reproducible results using Keras with TF backend on GPU (updated on 2019-10-08)
PyTorch Reproducibility (from the official documentation)
Chainer PR 2710: cuDNN Deterministic mode
Stack Overflow: Tensorflow: Different results with the same random seed
Stack Overflow: Are tensorflow random values guaranteed to be the same inside a single run? (comment) (updated 2019-10-10).

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.4.0

Apr 26, 2023

This version

0.3.0

Oct 24, 2019

0.2.0

Oct 7, 2019

0.1.0

Sep 20, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensorflow-determinism-0.3.0.tar.gz (12.3 kB view details)

Uploaded Oct 24, 2019 Source

File details

Details for the file tensorflow-determinism-0.3.0.tar.gz.

File metadata

Download URL: tensorflow-determinism-0.3.0.tar.gz
Upload date: Oct 24, 2019
Size: 12.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.18.4 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.19.5 CPython/2.7.12

File hashes

Hashes for tensorflow-determinism-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`563316b785614df300bbdec9b3c45a94d212761aec796cf1aafcbb2f69fdeb01`
MD5	`69138227a4773ea74b4fb5fee9b7ba9d`
BLAKE2b-256	`765679d74f25b326d8719753172496abc524980fa67d1d98bb247021376e370a`

See more details on using hashes here.

tensorflow-determinism 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TensorFlow Determinism

Installation

Deterministic TensorFlow Solutions

NVIDIA NGC TensorFlow Containers

Stock TensorFlow

Additional Ingredients in the Determinism Recipe

Detailed Status of Determinism in TensorFlow and Beyond

GPU-Specific Sources of Non-Determinism

Historic GPU-Specific Sources of Non-Determinism

Confirmed Current GPU-Specific Sources of Non-Determinism (With Solutions)

Other Possible GPU-Specific Sources of Non-Determinism

Sources of Non-Determinism in TensorFlow Unrelated to GPU

Sources of Non-Determinism Beyond TensorFlow

Relevant Links

TensorFlow Issues

TensorFlow Pull Requests

Miscellaneous

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes