Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {27, 35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy installed.

bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.12.6 (Sierra) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Dependencies

TFDV requires TensorFlow but does not depend on the tensorflow PyPI package. See the TensorFlow install guides for instructions on how to get started with TensorFlow.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow. TFDV is designed to be extensible for other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x/2.x) 2.17.0 0.15.0
0.21.5 1.15 / 2.1 2.17.0 0.15.0
0.21.4 1.15 / 2.1 2.17.0 0.15.0
0.21.2 1.15 / 2.1 2.17.0 0.15.0
0.21.1 1.15 / 2.1 2.17.0 0.15.0
0.21.0 1.15 / 2.1 2.17.0 0.15.0
0.15.0 1.15 / 2.0 2.16.0 0.14.0
0.14.1 1.14 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.21.5-cp37-cp37m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.21.5-cp37-cp37m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.5-cp37-cp37m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.21.5-cp36-cp36m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.21.5-cp36-cp36m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.5-cp36-cp36m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.21.5-cp35-cp35m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.21.5-cp35-cp35m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.5-cp35-cp35m-macosx_10_6_intel.whl (2.9 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

tensorflow_data_validation-0.21.5-cp27-cp27mu-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 2.7mu manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.5-cp27-cp27m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 2.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.21.5-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.5-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 57a38543243ab4e19f01d501559dccb52d129063f50ac8933bcefc4ccde22cf5
MD5 50ab2b61ce05c47239c4dadae4fa7d82
BLAKE2b-256 096957f7d60cb6952845fc03964aec4678b33e0394e7bb76d4ba27976e72a354

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.5-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 49fb9570648906d477bbe9b64d9a76318441c6927489d6f0a8fac1b7146fbcc8
MD5 e5fa917ad60c2f7b02a5d4936f708d10
BLAKE2b-256 8d86ee71caf5e98e2b2fbf2c4da3248ad94dd8840e30a41d953d9971120f59bf

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.5-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b04dd62ef0548a79ce8e8d9793d8650470e2359a45233f121e8e7fc31f750ce2
MD5 fceccedad9b39d965d664a8effbf5f65
BLAKE2b-256 bbc30707688e7dfc6fd77cea8c11134994b70a5b5335afff9a458c9a7af8fd93

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.5-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.5-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 0a40fc44980540720e47a9932547086321ed845dcf8ce9d0addf528401712736
MD5 e977081021b630f116cb45ab15ecfc79
BLAKE2b-256 4b50b7f2966c66a2569affba4d79a39c6648e21de739c982b2f1a393c2f88c17

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.5-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 555e7b764c683d72c8bb3a2e6f21d8475753dbbf587c6895ed2a40abffeb7130
MD5 4cfc75e51367c4680523e632c1c0ffb6
BLAKE2b-256 fad724a3932e633da767047974dbba2032991274e389cc10547d5ab0233939bc

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.5-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5c53a04d1d549baeb4ce0d985b6a9118c056d290fcf13f4681f9c4ed8ce462a9
MD5 43af862687453f3956d138187cc0f0bd
BLAKE2b-256 289ab3eefb7f6289cafa623267f025c66434ebe459e3a80e850bad5b1ccb253c

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.5-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.5-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 20ccea206e17752f489f6d9a56cbf50fd3be9b498deba3b1c63f94d1952dc32c
MD5 6d161e7d86b07a23f905def6c1e47b63
BLAKE2b-256 2df486dec001e9ada955f082157dedb9f5deaa7cee5264ca4fad33ad142be071

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.5-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 658230d733ec6bd265eb0d95b969b4390b638bdc2def35e6e7d21d6d244b7207
MD5 8ed2aa1dc3363c2e5409ec78145f6123
BLAKE2b-256 b93d690449e34e36ca1fa8f710288b7c1c7e879caec2644e6ebeccb43d31cda7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.5-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 a41db80518072d9b1eeca93114083588f365af8f12ee14cd2fd7067164e7d13e
MD5 7affcf46b6c76975d0d05e6d555d5f5b
BLAKE2b-256 fe6662d59439d73f30e3a629b4e562b0333e26b7d186c017d60fd05e91eec517

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.5-cp27-cp27mu-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 4d35e68e16e5be63bd0a4967b06f8cbae930b4f3fd0eb57cf0cdfd78b9f2798b
MD5 97cb9dd224306009e5689b720405d4ed
BLAKE2b-256 dc082ef44147e37544c2d17c1252cd7f085a13c61adcfac52aa477e4bda9e16f

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.5-cp27-cp27m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.5-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 886c859b20117c2b33d0d0bea44d175d104a8d39d3bd843269aa49290531c71c
MD5 abc5d2d09b4ded3c2315729e1efebb69
BLAKE2b-256 1d103ca0ac7710f7ca4c4a4d9325a05e6e30d2ce837ecaff524460d4b601442f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page