Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {37, 38, 39}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.38.0 5.0.0 nightly (1.x/2.x) 1.8.0 n/a 1.8.0
1.8.0 2.38.0 5.0.0 1.15 / 2.8 1.8.0 n/a 1.8.0
1.7.0 2.36.0 5.0.0 1.15 / 2.8 1.7.0 n/a 1.7.0
1.6.0 2.35.0 5.0.0 1.15 / 2.7 1.6.0 n/a 1.6.0
1.5.0 2.34.0 5.0.0 1.15 / 2.7 1.5.0 n/a 1.5.0
1.4.0 2.32.0 4.0.1 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.8.0-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9 Windows x86-64

tensorflow_data_validation-1.8.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.8.0-cp39-cp39-macosx_10_14_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.9 macOS 10.14+ x86-64

tensorflow_data_validation-1.8.0-cp38-cp38-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.8.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.8.0-cp38-cp38-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-1.8.0-cp37-cp37m-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-1.8.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.8.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-1.8.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.8.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 b7162511800310b82e60f9f95487508b4220537979f1ef132fc23b9c55b6a3db
MD5 a03b7f2591e27799aaf5e77917f8942e
BLAKE2b-256 ab709868dd14ddfaf5e207ff6053c13701c848a5e9ea9d5e5e4bcf3dc49c46e7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.8.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.8.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 769d84a22aea2eaa6737a9fa5ded7962d937edccf0569c2b71a505974c55941d
MD5 8d61a24f60093563ce2d854c04e376f1
BLAKE2b-256 ad4debbb0a22320f95d3d7a92cb82094b4ef7449c22d168dce9584ed7b19b372

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.8.0-cp39-cp39-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.8.0-cp39-cp39-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 81f533b0511ce2b8d02f105ad0f804fea0383ba3b8a26f25e29b5918b194f682
MD5 dcd9381f19130ca89edb348da969c455
BLAKE2b-256 34cf74dcabffdf5ed4ad38d1eb68695a1e00dd32b23a6f319bc50fe5dcb74f41

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.8.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.8.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 85ee1d767712cd9b32c6432167350c8cd8a54a76e96a53da29d4b2b6d9de9fe4
MD5 84e3dd42831d44c66e0264f9cf40ae16
BLAKE2b-256 993cf3f71fdb2deab5da41702844da076eb834f8cfb67537c9ef969dd2dcfe7b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.8.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.8.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 3df07402a09d2346757fc7f15e4ae4802718d3b8fc7b0dcc247a86d582ffff7e
MD5 4f1bcaf26b3aefa4d73ad4e5e35d28ec
BLAKE2b-256 1d2033508956f00c6863419e172b17d9630739ab6dee743d896621d91eedf87b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.8.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.8.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0a4db982ba67334370ac1554a212f2c02a90a374bc37f4d0d955db0e8de2a924
MD5 45cdac71bcccb01f276a5976994163f9
BLAKE2b-256 812f07820aecf784a2b076f0eb7e24587df05079604aa545f41d507ff0f486af

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.8.0-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.8.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 e182f73a6ded3a5848091a657eb2b460610fcdbb30ec658b736563eadecb89d2
MD5 bd9e40b8b6de7c797b86ba3963f948a9
BLAKE2b-256 2e3f098247f59cf5a2bc353c4eab49ae163e584c5d1d2a34a47c79b0f19cd013

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.8.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.8.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 2a9d0ef83628d3f34e7fc8183628d54eaab3430ece7f2ac256d986a0948409d4
MD5 c3d14304bf1ee61ba5c8b47a25937b71
BLAKE2b-256 f1aabaa940cf8eb5b7b2a24d77bd329d46b3e4b78c33867045c559803f6476d4

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.8.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.8.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5fedd0c13a9c35c2d1f6e79bd4741dfca715e371e8f0d91a6b541daac12d1a3e
MD5 6f4d97747d40d7052a458a89ad44e1ef
BLAKE2b-256 b5e027213a40ad746431d693dd03a9458ddecb8778cea0346506e399ddb861d4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page