Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.27.0 2.0.0 nightly (1.x/2.x) 0.27.0 n/a 0.27.0
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.27.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-0.27.0-cp38-cp38-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.27.0-cp38-cp38-macosx_10_9_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-0.27.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.27.0-cp37-cp37m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.27.0-cp37-cp37m-macosx_10_9_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.27.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.27.0-cp36-cp36m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.27.0-cp36-cp36m-macosx_10_9_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.27.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.27.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-0.27.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 c543ddaf38678889dc397b9b9c1698cf571e89bf10edda1eee3f8e646ce85829
MD5 a33e071c5c819333ec087f14d32f8a55
BLAKE2b-256 559e0b2c46479f5574b19fb420bbe5fa3c8def167b716b3931550ce17ab00a9d

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.27.0-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.27.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 4916a51f792cc4a538dbd1483fe7b0de6f341f1348c3c20b2e1ff4dee5e9be5e
MD5 d4783d1714dafad6b4dcaf6680efb4f5
BLAKE2b-256 3b2261427f7b785f47406e9c188d4a31227c760203614d05c119f9c712405f95

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.27.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.27.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 96b4c816cf3fec65aaab7d16ff361d1f7d1a74555b8290cd0c909af7a67953a1
MD5 0d9adf5d95d66e69017d7bea0f2e4881
BLAKE2b-256 86f099e529c0b0f2b50d0c45045fe5cb1317698ea992d87b970e69d49e0c0cfd

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.27.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.27.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-0.27.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 63fd38c3b22f5a1c9be832c0847e63e15539d262ec3e97d3ae2ff23ffdb39c2c
MD5 24de0a6fae42a33de362b05eb9a529bb
BLAKE2b-256 8bce5ca7596911dd544591df0d356dc35717efbdb1fad84ce98224ad4cdbf13c

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.27.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.27.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 80ef7041d63d8d1cfa0649062a6c366a434237f22685d69bb1c64795b237db2c
MD5 1ef2b819254c54ab12a3f686901e40c7
BLAKE2b-256 f2dde7cac2eac7f6af293db780c09435c435d492fd8205a666c6a2922823d80b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.27.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.27.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 be7da38cac6dde5ac6bc2c4bf3a1cd253e9cf899f3e06a1fda2d1674ffee2beb
MD5 d23d4a8b58b00a330333c83fa40271c2
BLAKE2b-256 55919a576527e6d7634a2d31fca95871b29b4c20a1a0e91309a2aec1a99f1983

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.27.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.27.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.6.1

File hashes

Hashes for tensorflow_data_validation-0.27.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 385f86fbbd446dbb2cb123e032dda0da95de59aaed73e07dae1fb5146b8f6fb6
MD5 137cc3b9d15aba76d75108d523e7b899
BLAKE2b-256 dc56f26f13eaf41487595b786e2b49ebf6932b37786629d7feee8346808ab327

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.27.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.27.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 391642809e332f2d11bdde6d027d462ba68e9e3fea8f64d938a4e8fa74f6c249
MD5 51eac725043eea20c2cb2c797ed8ee17
BLAKE2b-256 997bf166ed79fad66ef1bc2b4a88a38efc32c651c84fb0fb2b72617cd8c196c2

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.27.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.27.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2bcbde71059d685f5142a2bc372e1be331a53ed53cfc55015ec7ca7b88f42c6f
MD5 96115774ee1966ec72eb02d0748c951e
BLAKE2b-256 7c59430fc9452835455fb78bf104532dc851c16f20784f086b5556f142ed57b6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page