Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

export TFX_DEPENDENCY_SELECTOR=NIGHTLY
pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Sometimes TFDV uses those dependencies' most recent changes, which are not yet released. Because of this, it is safer to use nightly versions of those dependent libraries when using nightly TFDV. Export the TFX_DEPENDENCY_SELECTOR environment variable to do so.

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {37, 38, 39}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.40.0 6.0.0 nightly (1.x/2.x) 1.10.0 n/a 1.10.1
1.10.0 2.40.0 6.0.0 1.15 / 2.9 1.10.0 n/a 1.10.1
1.9.0 2.38.0 5.0.0 1.15 / 2.9 1.9.0 n/a 1.9.0
1.8.0 2.38.0 5.0.0 1.15 / 2.8 1.8.0 n/a 1.8.0
1.7.0 2.36.0 5.0.0 1.15 / 2.8 1.7.0 n/a 1.7.0
1.6.0 2.35.0 5.0.0 1.15 / 2.7 1.6.0 n/a 1.6.0
1.5.0 2.34.0 5.0.0 1.15 / 2.7 1.5.0 n/a 1.5.0
1.4.0 2.32.0 4.0.1 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.10.0-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9 Windows x86-64

tensorflow_data_validation-1.10.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.10.0-cp39-cp39-macosx_10_14_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9 macOS 10.14+ x86-64

tensorflow_data_validation-1.10.0-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.10.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.10.0-cp38-cp38-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-1.10.0-cp37-cp37m-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-1.10.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.10.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-1.10.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.10.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 9d38042e1286eaaab31c97ab3ceaf9cb275598b38f33799be5b94c6a00cc68f7
MD5 5ebd9d93b5ee27c909f6cded57f45c5d
BLAKE2b-256 f45758583f764bd4e0e4c196eab0834bde0d79780bd6caf8d29e58ff8e7dd491

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.10.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.10.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b2b8eba2c87d916117bcc45ce1b28ee96096774a2a171fa7b4ebef09cb906f68
MD5 8bf5f5685fe59e9d7a79e7362ab495ba
BLAKE2b-256 a097cbf19900a7bb7bd4feb906c65c7ecec0d99295a34fbdb0eb7dcbfac76d70

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.10.0-cp39-cp39-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.10.0-cp39-cp39-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 fa7e8f63cb72adc916e5d28da9deecfd5fd659f7eea584a0eaeebfaaec215118
MD5 35c4a6ba4eb44b31d419461f133e7ecb
BLAKE2b-256 540cad682ce53e510928a9490f7811a1c49787ab0462cd62cf3ec7b533c032e5

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.10.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.10.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 b36cb94101ce72a99bc99333b71d7e97e0707f30f38abb8e6283e759d6bfe784
MD5 bd9aa90c0c3fa528cc1297a23c608eea
BLAKE2b-256 6de53bc34f6861366167b07c3d9dd4aa35adb58b70df102f067a63ef86bc95e6

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.10.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.10.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 2f66ad1c0ab593673ce30d62279566748081d85cf4162efec627498ea3c32a5e
MD5 65a5d1c1d8f0aa47be071fbd0ccf8334
BLAKE2b-256 3375a8a43a6942b175e50e78bd8afa99e8e246b30ab8b5556041dadf5c123cdb

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.10.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.10.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 832cf39d3010feec86fb5fa960bc854f09ebe1d3c4eb11f5c145d41bf9f6e73d
MD5 9a1562cbde4fb7baf37665d814ff0cb1
BLAKE2b-256 7507b0d97da2f6b2ed2bead9091ba9e25d4cc542e312b7c0613310af6feb4cc8

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.10.0-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.10.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 eefe23e9fdee011ccd6db622bfafa1a6c2c939d50659c8c0348364f174211281
MD5 f0e11c3bca18669c42b77de0933c6f11
BLAKE2b-256 38255db2cfcec0cda456c0cdc31001878982ed874218c81e8ab464932255a8c9

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.10.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.10.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6fbc6617aadb86e3afe801e8d13b1f1def651c2540b78470c37435381c6848b8
MD5 902523f7e7751b841107d99ffd4acd19
BLAKE2b-256 e2fad3eceaa900f2e90fb6ad37aaf44ceaefaab9a1eceb64f0bbef36f87ac7d0

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.10.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.10.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e709f0a567eac407edc3a5213bbf997332aedfa6b6127d18e933e947766344da
MD5 95d9b65a294d014e8b6ec6aaae5fb0b0
BLAKE2b-256 f61a2ddc142ec60f807f772df01dfccc578fe07149fe28ed588a3bc544cf6cd6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page