Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.28.0 2.0.0 nightly (1.x/2.x) 0.28.0 n/a 0.28.1
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.28.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-0.28.0-cp38-cp38-manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.28.0-cp38-cp38-macosx_10_9_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-0.28.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.28.0-cp37-cp37m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.28.0-cp37-cp37m-macosx_10_9_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.28.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.28.0-cp36-cp36m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.28.0-cp36-cp36m-macosx_10_9_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.28.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.28.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-0.28.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 59c9eb8be4f2a5794d3ed788c6d2836a92eea05957c6be2f7cdeaefbbfa188c3
MD5 24b00a70de78620df7cf9ca2d63b9342
BLAKE2b-256 ae7fbdbbb454ffe0959c9ddc6340cdc6f4275b57b90143799168f97a95a9606d

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.28.0-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.28.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f824a96ac43bb929bc6d2bcb5611202c8f19e5f42a9f3cc501a2e8b4f75397a8
MD5 3fddeabe5bdb5fd0ef93143ff12cb81a
BLAKE2b-256 4c2b0fa4cc12f4f7adf2874d5a1550db433e9230725af770bd5ed05bad7d94a0

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.28.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.28.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9e11a8383f4b8f1b70b9b978c5a99bcf49731ea02680310792bd3ebb94bbcf5b
MD5 42082ab680e60395e5b24cfb9cb07aa6
BLAKE2b-256 898aad69d4301c0f6a530a7f19bb19f6ebe166eb725531f4421f10935821dd92

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.28.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.28.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-0.28.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 ae5478a89d53983fdcd7a7907da0a7bed62c41fdb454dd3e22e3bd78c1f63df7
MD5 a9c674ef5a8fce4bbd5e8fa0337b7215
BLAKE2b-256 c70d0ac9ce80c51d3f2022839eb715ee45d59c24266922ff3eb4298f48a2d786

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.28.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.28.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f822fa533e5df51d91f2a692a3a426cf8c1be65f0064df13fe09ffe43108f408
MD5 d08c19d8626d7148203adf394b6919d4
BLAKE2b-256 2cbe11982913b416c7c14701b080378c69d14a54cf4043f030cad5ed0440429e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.28.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.28.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b0206e2027dcbdc8dc0a1239bb40e45ee5c5097cd8886ba427d64d979f3ce839
MD5 260cb37142859afcad3fe0f4cd6131d9
BLAKE2b-256 1f473b0d8d590ffb73ba342531d1b219e6f4af5d44001ab6f2551307768a60d8

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.28.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.28.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.6.1

File hashes

Hashes for tensorflow_data_validation-0.28.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 dcf9a069131f3fcc672e2195f7317ed856b07f12d1e743e5a7a3f5796eb8d541
MD5 99c9816ecb146911b40af04d8f5a33c0
BLAKE2b-256 6e1a2558a392449aa4f5ecb04edf830d462a1c49c7a63d4323df93dacd7fe2c2

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.28.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.28.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 cadbc5f0a0ff850ff7517b7432e79d77b0fd3fb46ed442935a7c3e798693ba0c
MD5 0687d5392003af78108107cfe1e9a76f
BLAKE2b-256 dd76dbf7594a37c9c554dee910788508b8734e2a05f8719e8f50b47a718890de

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.28.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.28.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 03eef896e70497f3ebccf00e86d599503901ae12b48c2c313d75cfa5942fa697
MD5 986e1abcfbb41f59a29d9b9a4256057f
BLAKE2b-256 7782707dde789e03aed034675d20d7cbe8d03e2919e73a4a3ebd83579016bdf5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page