Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {27, 35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy installed.

bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x/2.x) 2.20.0 0.16.0
0.22.0 1.15 / 2.2 2.20.0 0.16.0
0.21.5 1.15 / 2.1 2.17.0 0.15.0
0.21.4 1.15 / 2.1 2.17.0 0.15.0
0.21.2 1.15 / 2.1 2.17.0 0.15.0
0.21.1 1.15 / 2.1 2.17.0 0.15.0
0.21.0 1.15 / 2.1 2.17.0 0.15.0
0.15.0 1.15 / 2.0 2.16.0 0.14.0
0.14.1 1.14 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.22.0-cp37-cp37m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.22.0-cp37-cp37m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.22.0-cp37-cp37m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.22.0-cp36-cp36m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.22.0-cp36-cp36m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.22.0-cp36-cp36m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.22.0-cp35-cp35m-win_amd64.whl (1.8 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.22.0-cp35-cp35m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.22.0-cp35-cp35m-macosx_10_6_intel.whl (3.0 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

File details

Details for the file tensorflow_data_validation-0.22.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.22.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.7

File hashes

Hashes for tensorflow_data_validation-0.22.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 f1293a48564a4512df5aa00ba302204cd779f1b0f5ecb133984b8bd13bc9840b
MD5 3235d56a8a91bdb0b0c2480fc23ad4d0
BLAKE2b-256 09f30484894cf78f96cf4f2d5fc8a2323593ca3c428d9ac06e70d82ed9cc3a4b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.22.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 8a8d21bf0061182a1f1883a570c187af41f6c5fa1c39faa4e6676b809ec2fc12
MD5 15cf6a66880adbf43d959e84f8050ca6
BLAKE2b-256 489a6f30c4e729eb01ee3ceaf3d8fc63b48310e47e50e993e8cc2a486e426dc8

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.22.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e59dc6ec45afcd43d5686ff091f977ea348a60ffbe313cafc54c8c7f31caccda
MD5 3100a208cc2e0cd798766a8fcdefacd5
BLAKE2b-256 d7f340332b0438f85f2751c3b7f5009f00d93422f059481dc3e0f543e03ec32e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.22.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.22.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.7

File hashes

Hashes for tensorflow_data_validation-0.22.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 c14f4928f5ffbbe261ef4fda21079bc0b33fe2b7e4e722b51e4ca53fd14fcdd6
MD5 502fc889e6f278e397ac8d4bde18dedd
BLAKE2b-256 4263890baf7e650316570aaf625da4c15b0db12acae4bcc9aa5267f5ff75e814

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.22.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 db702450f9cd8fd3bff20ea353cfdfe67d40f0b33211b0be644d2b88ba1ca727
MD5 688722c15f7099dabda055fab2bcfed2
BLAKE2b-256 ce12bde11a8aa6ad03959bcefae580365e594a113cb6548ce05b10d6a7739ce6

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.22.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4f1d7061edbf91e854d0d2261402f4b3199943928a1054940ebfcd703ddbabf7
MD5 88745298ea16569ef79e7eba7ffbef4f
BLAKE2b-256 81d5662690037825e0f173805586992d78924c5817b29caad1cf3e33f29033be

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.22.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.22.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.7

File hashes

Hashes for tensorflow_data_validation-0.22.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 c0b6a1459959c66736194b61d0b07314413318f9af2db504b298ce75f3bf8980
MD5 6b9fa0946c5f2e26e84b58cc64bad42a
BLAKE2b-256 fdf74bf248d6c70e1716dabaca9a83260b86f7cdc8a35b03d63c2588540f75af

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.22.0-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.0-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 84dce687a9752b5e2995b31b2b9d60673ce4189afd52aa92e7b98af39b36f032
MD5 de56de7f4e885b90cf51dd82912be27a
BLAKE2b-256 d205193a5f7c4aef65298596c074433796e999997ca2c6376252cecce54289e0

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.22.0-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.22.0-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 4e279a1a81c99bf27a859c9b8c1de19ac75abead7553172a294be52ad22d4ff5
MD5 12eda975f70e0ccf4ff77070023d6742
BLAKE2b-256 08bdae31beceeca361d4d5384f4d89c8d5c49d3adbc8508a9ac00d225b2b1ec0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page