Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TensorFlow Transform (TFT), TFX Basic Shared Libraries (TFX-BSL), TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.25.0 0.17.0 nightly (1.x/2.x) 0.25.0 0.25.0 0.25.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.25.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-0.25.0-cp38-cp38-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.25.0-cp38-cp38-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-0.25.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.25.0-cp37-cp37m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.25.0-cp37-cp37m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.25.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.25.0-cp36-cp36m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.25.0-cp36-cp36m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.25.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.25.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-0.25.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 e173ce9a457e5996c7d84e5fc8c4bcd0d633aca68d63cff7551e98b613aa8995
MD5 a9981d1c62adf8d06c05be79b09e55c4
BLAKE2b-256 8bee8dc77b90a38f476b9d6c985fbf598f0e4dc82ec3fb68af42020ab9b14747

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.25.0-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.25.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 44c5e4ba9be5d28e382a22c4de70c1a1bbe0cd4002b6a00bb738984502a10fe6
MD5 ae947f92f942e3ad3ff99e22584bea40
BLAKE2b-256 fed96b69d544e83310cb1583ac2ecd8fcf06d3239a9f9b75b489494ca45f6d98

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.25.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.25.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 aa7c29f6a3e596993c52733126cfbf2ea87cf39705b98d9d38d1d551d7ead922
MD5 ee5c414dfa72e5bf1a8016f1333cc76d
BLAKE2b-256 a358eff3e1f115eb49459e0797d95afccb11d0cecf61d26eeaac731e321f1c8a

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.25.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.25.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-0.25.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 5af3b3fe029d8992c8a0322c69f5c8e504d34f4989284321397b90970fc046b7
MD5 60f7b082bd51af856d0f8ab5741ceb1c
BLAKE2b-256 020f9d0041c68b1b130faed0fb58fc0647f2ffe33aa7bd0574734465eaa3aa4a

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.25.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.25.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 2e30495159f9fc81c2e25fc4108fa7324464f9313e2beca0b1e86973d95d8215
MD5 246b8241cc647ef07ebe056802397cd6
BLAKE2b-256 0aa43ed3ad02916bb19b1c0a747e8f6bf3917205b8f3d4e18b6d61ea6fd4ce7e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.25.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.25.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 69aae82f737c707a103f72fbf6d385d67853071948e415caca4e226808ff4185
MD5 7c62b30cac6aa846ed9162a0c1b92fb4
BLAKE2b-256 4a1c01d6fd154122824253e3d99c9e34bdf13439bf711c36a040f0aee15ef45f

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.25.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.25.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.1

File hashes

Hashes for tensorflow_data_validation-0.25.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 f0fc1ce9637a45a0fbbb41d8898e4e78a6c9edaea3b77b5639e64af5e8e61199
MD5 086880ca7fbcdccaa72efb2021c30f0b
BLAKE2b-256 fa7761d922d9981d606445ce251faa56201f64611d3ddc8bee05bac82336a588

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.25.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.25.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 0fbcede4b44a7922e10f6dd2788251229c0f7160d06d8e0a8769ae9629c06416
MD5 121474ae6d2f152bd3da7c7cb8cafa5a
BLAKE2b-256 62198e4804e565f7bc1d90db37252497c3beb9a7e215ab683a8ca019b5ab9311

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.25.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.25.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7f26f15137edb6298a75035ddd0876b410b7a434d962cbc37abd0fd04b3a0c8f
MD5 2e5abd885081ee826207a374e8c96fa6
BLAKE2b-256 27260b4d4561b5e2ad212c1260d36fddd93f087aa7403f2771d58ff765e00689

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page