Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {27, 35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy installed.

bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.12.6 (Sierra) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Dependencies

TFDV requires TensorFlow but does not depend on the tensorflow PyPI package. See the TensorFlow install guides for instructions on how to get started with TensorFlow.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow. TFDV is designed to be extensible for other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x/2.x) 2.17.0 0.15.0
0.21.1 1.15 / 2.1 2.17.0 0.15.0
0.21.0 1.15 / 2.1 2.17.0 0.15.0
0.15.0 1.15 / 2.0 2.16.0 0.14.0
0.14.1 1.14 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.21.1-cp37-cp37m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.21.1-cp37-cp37m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.1-cp37-cp37m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.21.1-cp36-cp36m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.21.1-cp36-cp36m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.1-cp36-cp36m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.21.1-cp35-cp35m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.21.1-cp35-cp35m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.1-cp35-cp35m-macosx_10_6_intel.whl (2.9 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

tensorflow_data_validation-0.21.1-cp27-cp27mu-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 2.7mu manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.1-cp27-cp27m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 2.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.21.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5rc1

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 4f18e997f9b8f7e98b8aa15e390f3795e93bace48595dbcec7ff4ec3d709aa31
MD5 aca15abadaf4db3532ec74606ca4c6c2
BLAKE2b-256 1358625a924276a574baa71aa9f78643f77f6c3230157e259b745825581c1a4c

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.1-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 a088474fd48dcdc7716f153fa7f1fa48f684826520116289485de207a47c5a98
MD5 9dbc150e1d1a7750d817d2d93e1a90ed
BLAKE2b-256 e38a7d517899bfc0a5a99961ed3a8127ac69206498177a4f627bdb00a25b01f6

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.1-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 800e63a3dcc98ce7d6cd2599d1f0fe7a45a595fb02ced1e5586deb071e259f5e
MD5 98c19c5b226bc73777e782352536b52f
BLAKE2b-256 dcc1c0bbd57a232032645298767db6dc259e21022b7b4b7de99e5866a50e2a68

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5rc1

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 5df5d6098a4746ebb0bac4fcac54f0df6f667678314cf3c88cb12f9eae303097
MD5 7190e9529c8e98f2fde6858bd5a96036
BLAKE2b-256 2746e8b9f4798eda2911f9d9d1a01bb4d3e4772bbddf546e249f3139541161ba

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.1-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 69efa59f4a32fcb057e82722b7d81aacd48f9fe88bc53d90ef661bc8ed1ff716
MD5 5e4c6ca95960c5eeccd07d09a4478019
BLAKE2b-256 fb47b0869e548a54fca205246280fada44239058d31e3054a33e8a4440b6ce86

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.1-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c75009ab588ff9ee112afdbe0d708ad4d0b15fcbe7f2a05902ba20a2602d1ff8
MD5 6b0c5b6dce77e0decfbb4786f3650dd9
BLAKE2b-256 f6cec93de698dd0fc031bca48919046fd9e812d45b0294abc1f9b711191637c5

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5rc1

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 d556b0d2fd9c8b17137623f9c16282e3126a08a9704066344a3c10bd307f6de4
MD5 242dbb33951da4c8ef54a7a43ea7d107
BLAKE2b-256 075b898066de47c4e56cbe8d0346efdfc779265f9ab2679e699baf0797096fc6

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.1-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 180789ea0bc306a5cf04caa2b527e8f2ea67cae8b950b088b7742008d2fc1dbd
MD5 1f30362172f9ddfe60a726994987f0d3
BLAKE2b-256 251918abb61ee9bc8ec114ed4a470ba8652ae46bedc6b1c66cc2b9f6540da122

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.1-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 201be5c7eedb7546f814d7e38fb0c300ef92ce34517c72c9dafeaa5b31b43741
MD5 0de3e6d0b22e665070270f27e2a06d09
BLAKE2b-256 714aefdb5a01603d185b5e92e6e886d5f3befb258522f90e2718c82016ef6f71

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.1-cp27-cp27mu-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b8957e1df5c4e9b93be0ad4ca5c726a54e979bc6faf0c79356bbe136ebb0bbc4
MD5 6e25a20d523882c784dcc1c4e7955df5
BLAKE2b-256 1ac4d71f1a2f72853c7ec723dde1d32db37ba3df3a35cc277ea74dff84e81b41

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.1-cp27-cp27m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.1-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 25f3580b8ca0844890d8bedaccbbc58d90d96d595333af8a147a19a05bd43442
MD5 7af59eff29f7d5f82941e890a6d11d4a
BLAKE2b-256 96be9ef4015903c181a2065dcec94012798545c75d31f3797198f4485c328f2a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page