Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.28.0 2.0.0 nightly (1.x/2.x) 0.29.0 n/a 0.29.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.29.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-0.29.0-cp38-cp38-manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.29.0-cp38-cp38-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-0.29.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.29.0-cp37-cp37m-manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.29.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.29.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.29.0-cp36-cp36m-manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.29.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.29.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.29.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-0.29.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 2697eed4d5d79561e82f39be7dc91f2ea9286f18edff167bedd555e7ae8a5aca
MD5 d25294f75cf859141a3b2353a9e0932f
BLAKE2b-256 202ad344b48ff7a79c2863d2bf5c7e45ec6ce3fe287300f10baaa38c89e0b179

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.29.0-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.29.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f137911f9f0d96250ad9a5a5d279a7c628bdcb4ba2ada9cea0dfb051d800997e
MD5 9f8957ae699b888ce6b86135690939bc
BLAKE2b-256 50bc02d8187422d502b2fadac93c14c6a7d7d4e2f8afda3ecbb9c2a669115e20

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.29.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.29.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e1d23bc20f4b5690fe6802da77c0c1c52ef67f0c8235923ebd80cb0a4cd6787e
MD5 807b3e5bbeee3d5011353e743ffe62ff
BLAKE2b-256 ff569057c300b32434a81ebd7e0448f2ef08d972495d0169aa78137cdbdcaa47

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.29.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.29.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-0.29.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 da1b6f893e5668faeeda09b34e470d7177c118e5bf300934dd036a7639e922e0
MD5 8fabe5539b98daedb7d55f3c3359aa7e
BLAKE2b-256 f7fdafd1251702ea58736f31c1200da58ec744c849df833c8fd1a92a608ac3fd

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.29.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.29.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b972a2b3007e3b519379940d1970206e4ffa1947a27adc344475f4818227db3c
MD5 05afaed520cc0f0f1d86201ed482584d
BLAKE2b-256 01f977d1ac1ed0e1a6729c09e4d36cc5a3b176eb01d77d8e71c32631327c60c1

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.29.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.29.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 41b889f0cca1c7d7fd42d2ff8ecbbce0ed4b30341dd8b80d769c80f7d1ce41d4
MD5 53c559cdc332303255220461465be865
BLAKE2b-256 33c841f1e9fafd81ca87ed120f6e9ce1427d3adf04461fa3bbf6e5c68a0f715f

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.29.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.29.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.6.1

File hashes

Hashes for tensorflow_data_validation-0.29.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 fb407579c9bec174af382248dcd67a904c7632e30ba50bce6094b1e1f83a173e
MD5 a98ae4434c89942f8dd9a16555a5360b
BLAKE2b-256 d3275bc3780c1710a9b141916f6b4f1e151b1e3bd4ceb3476ebd5d03425c08a1

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.29.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.29.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 5ef6714add65383196c823ad772f46024c82237dec3c6b89601cf0f145e31844
MD5 77293ada766db45dbf08ba6c7e46f1af
BLAKE2b-256 f51fcf79ee8f87787c80288fc0318c95a0de7c4ed5fabbee5645a0275fbd512e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.29.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.29.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2d4b83ec9d97f5342a1228d82f0b249614b8df5207b75cc23523f335d406fb3e
MD5 687d13e4974b9e02d17ca54f499e278f
BLAKE2b-256 2b0a5323b8b4253a6008b5ea11e8709dd63dd4150088ec8301925ddb79fdb6b7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page