Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

export TFX_DEPENDENCY_SELECTOR=NIGHTLY
pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Sometimes TFDV uses those dependencies' most recent changes, which are not yet released. Because of this, it is safer to use nightly versions of those dependent libraries when using nightly TFDV. Export the TFX_DEPENDENCY_SELECTOR environment variable to do so.

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {37, 38, 39}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 12.5 (Monterey) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.40.0 6.0.0 nightly (2.x) 1.12.0 n/a 1.12.0
1.12.0 2.40.0 6.0.0 2.11 1.12.0 n/a 1.12.0
1.11.0 2.40.0 6.0.0 1.15 / 2.10 1.11.0 n/a 1.11.0
1.10.0 2.40.0 6.0.0 1.15 / 2.9 1.10.0 n/a 1.10.1
1.9.0 2.38.0 5.0.0 1.15 / 2.9 1.9.0 n/a 1.9.0
1.8.0 2.38.0 5.0.0 1.15 / 2.8 1.8.0 n/a 1.8.0
1.7.0 2.36.0 5.0.0 1.15 / 2.8 1.7.0 n/a 1.7.0
1.6.0 2.35.0 5.0.0 1.15 / 2.7 1.6.0 n/a 1.6.0
1.5.0 2.34.0 5.0.0 1.15 / 2.7 1.5.0 n/a 1.5.0
1.4.0 2.32.0 4.0.1 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.12.0-cp39-cp39-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9 Windows x86-64

tensorflow_data_validation-1.12.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (18.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.12.0-cp39-cp39-macosx_12_0_x86_64.whl (19.5 MB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

tensorflow_data_validation-1.12.0-cp38-cp38-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.12.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (18.2 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.12.0-cp38-cp38-macosx_12_0_x86_64.whl (19.5 MB view details)

Uploaded CPython 3.8 macOS 12.0+ x86-64

tensorflow_data_validation-1.12.0-cp37-cp37m-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-1.12.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (18.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.12.0-cp37-cp37m-macosx_12_0_x86_64.whl (19.5 MB view details)

Uploaded CPython 3.7m macOS 12.0+ x86-64

File details

Details for the file tensorflow_data_validation-1.12.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.12.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 47b02311a584a80780e181f0b333da05b2c4546f61630b018433acbf7fa848f9
MD5 d11e12ff22de63ad01a5216636436b5d
BLAKE2b-256 bc2b846459d1222a168559beda216243d8978bcd282cc70be80be4e18228a03b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.12.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.12.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 31c23e2b542926e4f0264fa759610a1252ddcc5f3e797e62cd2fa11c7f3ad130
MD5 294eeb40ba2884321d9968b80d8570b1
BLAKE2b-256 9854486bf3d17a539004c8d754c4c2adfa48a1d609128f6d4ab7a7e23208d420

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.12.0-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.12.0-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 3d8f708f0a134f23e9872116404a65efa76ad507693c431fc60f00d71f828c11
MD5 196707ecb291f2ff622bb16f2ec0571a
BLAKE2b-256 96fc24e8bb3c81360981948d4aba66b6a3aa708de308357234d61579a228d551

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.12.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.12.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 842839f5ed8def1d2d6b36f5b050f9d2c522b173fdf5ce3f8093af08aafdb78f
MD5 2cdfa9f5c700352d158c7e8ac6e97845
BLAKE2b-256 c7f7c5e36fb40ab3b9904accd4d02f946764f775613f22153df8c5aee495d92f

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.12.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.12.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f6bdb083b982263e776bed63329ec6af3f49b82bc61cb80ce69a76a0b86746fb
MD5 c1ac7ae2a5dbd8d493297f3ea00379bf
BLAKE2b-256 77cf5867f0c384809f8a7c4096c7033af4cdf2e16862899134110b3cc860e053

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.12.0-cp38-cp38-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.12.0-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 3b46fd1e7082b8b27c9f8e25a1fafb7e948ba0a382dd0da217c98504a1e15a5c
MD5 7812c052b16eb5cb4d80f13265613955
BLAKE2b-256 90048b962539b53b87538fd29c66067f3f3d9dd5e8903f897903c94a095971d1

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.12.0-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.12.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 f08010ed12b249e193bf652200a5b093ba83046e45e3172bf1dc12fed91020f3
MD5 4c9b73fabdcac0a02c851952bcf6ad41
BLAKE2b-256 313ca948fb944d9d5a7138a76f8121e428a3e2f3cc195a807d34f70b221c5828

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.12.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.12.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 8b68430e5a04ace8aa4d6fbac7be4b9f72771fdbaa23036f62f39b5d65dfd285
MD5 2c5546a647d6c01a2d87c550144342a8
BLAKE2b-256 cf930cb6e70c64278ae4055b424ad27b9ab75ad39eab54d09fd0efcb83a108ff

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.12.0-cp37-cp37m-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.12.0-cp37-cp37m-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 26a7824ac51a5df1714fc186f528e354b75d3e77d6bcca972b626704f4a074a9
MD5 ee8c0e7d28248c8c35edf32a66a4e1f9
BLAKE2b-256 7fc3e8bbdbd855ca1e5f445bf0f94a6bbb8d4e368d4d9ae3535280c4981f0143

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page