Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.32.0 2.0.0 nightly (1.x/2.x) 1.2.0 n/a 1.3.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.3.0-cp38-cp38-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.3.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.3.0-cp38-cp38-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-1.3.0-cp37-cp37m-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-1.3.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.3.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-1.3.0-cp36-cp36m-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-1.3.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.3.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-1.3.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.3.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-1.3.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ec69a601de7daac590958809c1f62e75b0fd4cb5c702e87ed3fb579ec3e366e3
MD5 ba8eb72a8f4eab947c3614b92e753be1
BLAKE2b-256 c9cb44f6bca06755d092e52316b3681b3dca43673cb6e0322d0afddf2c706b5b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.3.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.3.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 911679121d9c0a4b477cdfb89fc38fea5d8033764e2bf24f8e3bb6932106a8db
MD5 0ca2541b5674ffa2ccb57e29fdf47ab9
BLAKE2b-256 fe8187240aff23fd3abac9d779c1108062a612bd583e35bc838fe04606d1da5d

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.3.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.3.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 918de41ecf788af8ece0bf7e6cc95fc6d66e3075316c786d3705010cb73bffcf
MD5 f7d0ff5063ff55bd79f8f7334dc4a114
BLAKE2b-256 fde71a04db450097882bd6dd7a4c4f8a85b89c1ca1758211ae1b05b5f31a91b0

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.3.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.3.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-1.3.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 4d67b13904e5897e5419555bae233b3d11b05ba0833397685bf6f1ef8c6f878d
MD5 c2feffdbd7675527f0a258dd2499c819
BLAKE2b-256 3066ac6d355634e10382891c9c0d9c69b3f047f3b92f7efe88cf8c8d6b20aea8

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.3.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.3.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 7da9604d85c050cd3d05aefca324770b80583cf2f5d5b52f3d560809774f099b
MD5 24110ac29bc20bd3d34669c1321a8177
BLAKE2b-256 438abadd7860ed4587a578c203e1cf14926104cf5eaa3a0351f16e2d796787d8

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.3.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.3.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0c28c8020564a46e66a4fed3d83256bf5e464f846b4e8fb4b8c1f905456a37d4
MD5 e00655ab6659444970ad7e71798663e9
BLAKE2b-256 86310432c1c143665c6035636f2480f749c2dd7f8f8001314bd45c08b8fd7245

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.3.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.3.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.5

File hashes

Hashes for tensorflow_data_validation-1.3.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 1dc2daf5b28c4ff5ddf27dd39736d1fab15164ecd7e8e70531748b3c67258ff8
MD5 5613b39eec117dc8d1d2ae6c05b1e517
BLAKE2b-256 2aa1fce84c6800d63808622200fcc22499b1ec0e7742ba82283ca011170b7a7e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.3.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.3.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f747e0275b1e635b86993bb24d7a70055e7fb9556194423e3d18900d9abd9bd3
MD5 fd0ef404ef2bf01a2e9a2a09c421d22b
BLAKE2b-256 6bc5bb30dca10f43c6d12389e925b496755e6743cb3ba2246712375272797363

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.3.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.3.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 793398446df164692fa6cc07c6ce652adcfdb3ff9c86977a719ae1c162fb673e
MD5 f60a794b2def499abe013ebbeca12df6
BLAKE2b-256 b758aee980b083a95fbf8238e9a463364349e140f7d308db8a70d12a1eceef2b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page