Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.24.0 0.17.0 nightly (1.x/2.x) 0.24.0 0.24.1 0.24.1
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.24.1-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-0.24.1-cp38-cp38-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.24.1-cp38-cp38-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-0.24.1-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.24.1-cp37-cp37m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.24.1-cp37-cp37m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.24.1-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.24.1-cp36-cp36m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.24.1-cp36-cp36m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.24.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.24.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-0.24.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 f5d8f371f649938dabf87d76ac2266f3c3f5cbde1ec9986deeeb37c0a7a8b9be
MD5 5f7e9b15ac539fe6359db0fb6a716477
BLAKE2b-256 2d3a635dc698e4e698aa729c73f9743a236d85d5b34f1abba456456a41f48733

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.1-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.1-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6bee7fb42ac547d829c634e8bda6607c2d9c366d60e0228d847157d79f05602c
MD5 a61d37a9cdefeb14df44212bd8184f24
BLAKE2b-256 d3ec9f846e550dbdac15b4c2e06a041ec7ca629907ea42b4e5182fc3135bbff8

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d886b0defedf91373b70598d605cbe4bb2f5a961654a8bb476f26790ca2add97
MD5 06a32e7f8cd46f9f4081ef82492c9c3a
BLAKE2b-256 8ad2e6fc8924ea41e345b1936bffe7fe7b86202d5c24cd3ca988a36b751105cc

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.24.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-0.24.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 2387403bb9083a528b68ba8cefac75d78249a97f790a71f0befb0416272ede4f
MD5 95ab331250c4e29d85f6296aa723007b
BLAKE2b-256 a6ca12cb79f7600046ccf3c5222d06699eaf9d615d430d8a66aa4d52bf72836b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.1-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 8adbc61766c2328c915a9d2370d7a2d08fcc85f29c5393a08a82fde41c7d45e6
MD5 f86a2e5719855cac768930d88c4b6481
BLAKE2b-256 530406b6be6a8770c72aba0ffa63f27712263fd3ba15c93f1ca71133b9bd888c

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.1-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 99be7b7ff78beefa0e77489f184196ba0255c03d73b702edfc7de64b556834f4
MD5 075f2d82b37cd3b6ae36963500cff1e7
BLAKE2b-256 ec4720777767c8b65b287482fb713412bd326ff693d4b36f5889ab7f37b66602

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.24.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.1

File hashes

Hashes for tensorflow_data_validation-0.24.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 6cc45f7738fce8a8780cdeff5a5d1a4371e8209dd1acef2f97719845a321a53a
MD5 05eca1e7d7481113536e7291bd9d7819
BLAKE2b-256 89b1384415beaa159fa6e9c6a432611bde15a57c64fd964bc87ba8eed54b7159

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.1-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 781c904bd7e9a1049a9a0f59f59fd20caeabf87674377cf3e8c02781251dd3d5
MD5 8fb5fb35f7008ad896692d192c82afb2
BLAKE2b-256 ccdceffbb27578c8da5d167e9d27c7ef78789be3fdf87395d505206e2d6e9e83

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.1-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b0bc73e928b8ccf8b26b96db528fe6518e5e51248ef7a6b0af0a6fb169433805
MD5 209c2cdc4a91983a67fd4dc572f4529e
BLAKE2b-256 83fabdd4047f25eb4aade7582099a364d9c2a82beb3c2c661ecbdd5e1016dd62

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page