Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy installed.

bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.23.0 0.17.0 nightly (1.x/2.x) 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.23.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.23.0-cp37-cp37m-manylinux2010_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.23.0-cp37-cp37m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.23.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.23.0-cp36-cp36m-manylinux2010_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.23.0-cp36-cp36m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.23.0-cp35-cp35m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.23.0-cp35-cp35m-manylinux2010_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.23.0-cp35-cp35m-macosx_10_14_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.5m macOS 10.14+ x86-64

File details

Details for the file tensorflow_data_validation-0.23.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.23.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-0.23.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 baea385051f9c1ea75edb3ee5adf78965b0b72e0d7b9a6eb19b2af09585c24fb
MD5 40d541771ea80d97919b2fe8220c0d35
BLAKE2b-256 581ddf400c50043c925df81bc5c5ccc8d36c3a4b6ef20a2323c26eb4cedf20c4

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.23.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.23.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 429e5c589ec5e48fa4a6073f32d4d69be55ee1ef830bb06d64fb916da01500d0
MD5 33861499b7c7162aa891fddcd66fac1a
BLAKE2b-256 f11efa9cfe298e75524bb919fd3404212a03f9bfd30686870c3a3e8d254f90f7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.23.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.23.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 cdaf8ed7bf81c6ddcc6a036eca858a0097f4320913d2b07d753c803ccb48b6b5
MD5 e22286b22952f277607a422ade73d2b6
BLAKE2b-256 cc2cd65e34e08bcd7d27b860e46cd6e9c86ec060accda53d66f88c6b964685f4

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.23.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.23.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.1

File hashes

Hashes for tensorflow_data_validation-0.23.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 f06fea9b200a6220393dd665fad57c755c8d6cf16eb083e5ab06dfe6cdfaa93c
MD5 92693e0c271b4904657c5ce3507d421e
BLAKE2b-256 77268e482666de2c359522ef04f5281d15f04b4361bba3e73bcd26177f557465

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.23.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.23.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 621dde0731a3d7d563faca831f7a386b181efb5a45eb780ae94b62083f6677f2
MD5 d141e213cef1af9a722a46429afa3f8c
BLAKE2b-256 caaad57f4fb9840e8a35e44b2d8912809420d657344bd9dc3008606344e851e1

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.23.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.23.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bcf3cce375afd5d2d8cbdf0761d8c3662d01b09af60557ba8dfaf4b1b472356d
MD5 1204c9d1d03053246922e9080eb60d5f
BLAKE2b-256 43223013b07f7d8855fec4282cfc47902169fd2beb227b31cf468938c9a4af27

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.23.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.23.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.5.3

File hashes

Hashes for tensorflow_data_validation-0.23.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 be28a3e8b75ae1c76dbc7476985a6b80a4df592e732e2850737f466f6c9a6dab
MD5 9da2ec046cfb2573275535e6a0f31059
BLAKE2b-256 95cef1f710312cb4bc54edc8881f0ff03d75f2d5b8559a7f1134d59084bd12ce

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.23.0-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.23.0-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 bc99d8ece2bf602a00613afc9a9b562755bb8fac894cd125ea92d8375ee3fee8
MD5 673dceca62a07519c47da80fd4da45aa
BLAKE2b-256 8b422ec8852ce193d0c7017bbf97b48c6cd16355877cc7eecf31914a1f330dda

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.23.0-cp35-cp35m-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.23.0-cp35-cp35m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 06313b3ac87b3a35981494d658bea49126e716100a8e9449d5f7ab5f6c0b7911
MD5 e46984fd382c6422180736e6510c36bd
BLAKE2b-256 ebf18db37977431f9019e560efc20c016f8453e51c6f8697a632b100950c8d8e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page