Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.29.0 2.0.0 nightly (1.x/2.x) 1.1.0 n/a 1.1.0
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.1.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.1.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.1.0-cp38-cp38-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-1.1.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-1.1.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.1.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-1.1.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-1.1.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.1.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-1.1.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.1.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-1.1.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 93829da25839a7f35c420da927258b122c94aff1410fac6bfec289a4bca0d2c0
MD5 b2063eaacf8372d37d1b3259e5b3e411
BLAKE2b-256 6d9ec13f2f687509435fc2abc2c9312bc6622be2ffbc76807e0128e0956cfd27

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.1.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.1.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 879a44e1c5f74191b07591eb6b894bd6176a7a71025d1a1601dd8f7ba6eaf3bb
MD5 de12e40e21b3beededea28199d210133
BLAKE2b-256 c05cf559f8276a18fcc2d83c4624c146da79f27dd9558c81ccf4754e6c8c8bc6

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.1.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.1.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e7718ab948e7982877a01299b4a7e7f53ea2b15c192a4992067a36e74f053f2b
MD5 b28a1840c28be8db3dce0c778ea27f95
BLAKE2b-256 090b595097e16876d4910873dd27af5c753d5ae4e05798e9f585272ca8392c81

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.1.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.1.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-1.1.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 a11d633854fa11759d2549aab0ad90259a0ab79c52eb56ffc431935b70b84258
MD5 9a6cd7683b96d785bf5b5b243c21af37
BLAKE2b-256 0ce5c6f742955c4871ae75ee3c30511ce5c7d213393dda5721ca953230aa01be

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.1.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.1.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 61e689c965e6591b67024a65837be6449c9f851988e9fb41fc6b78783f2b4408
MD5 f8c65ab81b1871bced39c3b6233213de
BLAKE2b-256 b4b5f4a5ed5c44af7157a5606c44769b8050878db15efa7d411d4cc6918f3243

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.1.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.1.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d15aa5f6d3453219a8f330f8144afdc3b5636adbdfb7b04e85dd5aabab9b7b14
MD5 93e16c49c07641201a85c5781bf6ea0d
BLAKE2b-256 c19319e5a2ff52bb07add93fc2b26919bd854bef3bcf1a66ee5b27d37ed62eb7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.1.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.1.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.6.5

File hashes

Hashes for tensorflow_data_validation-1.1.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 9b1fa16c0b813b87b4c23044924135fef01e462034f689b0e9f51fa9b02e2f86
MD5 1d74c0ef19845c97544a09ab0de0b623
BLAKE2b-256 ad26a42eeee0dbb016c1f32a2d9141eee8ca958312d68adf8b1351245e7ef2b2

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.1.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.1.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 70aeeb07826010e72f0f5d3d27411418ca2a282378c7e01c1c0444944cea6d4a
MD5 3c93657ce35800a2a43a8b6d48a9f798
BLAKE2b-256 3cb9bf8cbd5472bf61f2b985067e9fdc5b9cc939a01a560587c8a87f5a21edab

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.1.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.1.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2bf0004ecda100be2ca7149780aea3e345dc3d01d000847d3b7e88d8d25a23ab
MD5 5b6b2dbc5cfb0f44cd1ce642f735ff4b
BLAKE2b-256 d479d671a6033401c5a1faf13d34d3fc9f7c6366721088ca3a22849be6807b03

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page