Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages on Google Cloud. To install the latest nightly package, please use the following command:

export TFX_DEPENDENCY_SELECTOR=NIGHTLY
pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Sometimes TFDV uses those dependencies' most recent changes, which are not yet released. Because of this, it is safer to use nightly versions of those dependent libraries when using nightly TFDV. Export the TFX_DEPENDENCY_SELECTOR environment variable to do so.

NOTE: These nightly packages are unstable and breakages are likely to happen. The fix could often take a week or more depending on the complexity involved.

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {38, 39}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 12.5 (Monterey) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.40.0 6.0.0 nightly (2.x) 1.13.1 n/a 1.13.0
1.13.0 2.40.0 6.0.0 2.12 1.13.1 n/a 1.13.0
1.12.0 2.40.0 6.0.0 2.11 1.12.0 n/a 1.12.0
1.11.0 2.40.0 6.0.0 1.15 / 2.10 1.11.0 n/a 1.11.0
1.10.0 2.40.0 6.0.0 1.15 / 2.9 1.10.0 n/a 1.10.1
1.9.0 2.38.0 5.0.0 1.15 / 2.9 1.9.0 n/a 1.9.0
1.8.0 2.38.0 5.0.0 1.15 / 2.8 1.8.0 n/a 1.8.0
1.7.0 2.36.0 5.0.0 1.15 / 2.8 1.7.0 n/a 1.7.0
1.6.0 2.35.0 5.0.0 1.15 / 2.7 1.6.0 n/a 1.6.0
1.5.0 2.34.0 5.0.0 1.15 / 2.7 1.5.0 n/a 1.5.0
1.4.0 2.32.0 4.0.1 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.13.0-cp39-cp39-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9 Windows x86-64

tensorflow_data_validation-1.13.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.13.0-cp39-cp39-macosx_12_0_x86_64.whl (20.6 MB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

tensorflow_data_validation-1.13.0-cp38-cp38-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.13.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.2 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.13.0-cp38-cp38-macosx_12_0_x86_64.whl (20.6 MB view details)

Uploaded CPython 3.8 macOS 12.0+ x86-64

File details

Details for the file tensorflow_data_validation-1.13.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.13.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 fcc9a07a1963fcc1566a708e0190484a409995516c3e988b340aa67b1672f758
MD5 760f2ac6afa6a319632577250241cadd
BLAKE2b-256 b0b0be70b70a3698dec48371e3485cc01b9b8dcb1a978a446bdf36afd029ddfa

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.13.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.13.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0d3fc3f9d40ad3feb9ef77d0d6a992b2f5aa60dffa6bb64e695faaf0ddc8c783
MD5 8f2add50c81275d66affbbeab8f939cf
BLAKE2b-256 2009504416e60eb415812e81eed2f65c3a4f4007df116914375f1de70fd5879a

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.13.0-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.13.0-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 6aaea4a2197aba146c11d4c5b5e043a2850bf635ecdd8900cca9dac522af3924
MD5 432be102b6104c27bc39c931dd297b29
BLAKE2b-256 a13ede7a5ba9bc41d888eb1f73e203271433b062cd448ef7cc5c058fd7c2ae9c

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.13.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.13.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 6e372fcc35b643be5865f6ebe6b19ffdf1d255bed5ca4f6443c66b64396a906a
MD5 238f05ed202ea10fe129445c554b5f5b
BLAKE2b-256 9ae73362ced3c6bf4243f4e3a2ca66e4cdbd3ab0e0d9701ea10503a10c4b641c

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.13.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.13.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b747964fa0ee8312f3c158ebf2984ec84200e666ff1e3ef277f4f376ddac8ab0
MD5 ab3cbda94e576cebfe80cc0eab566087
BLAKE2b-256 f7aa8fa6d0e5c27c0ce5b79a777ef95796e3049536b39c5f70aa9dda731940f7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.13.0-cp38-cp38-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.13.0-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 8e81f452bfb0756e65b27f3e017f6d352083812e722183783dd43a36de87b92d
MD5 0ede9104f79e3f2dc0bf58a0adf24a67
BLAKE2b-256 8b9b0ff5ee03375a2279c6132a2c9a3fd0fd77566ebe167a6c49c26981f91bc9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page