Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.29.0 2.0.0 nightly (1.x/2.x) 1.0.0 n/a 1.0.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.0.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.0.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.0.0-cp38-cp38-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-1.0.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-1.0.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.0.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-1.0.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-1.0.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.0.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-1.0.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.0.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-1.0.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 24bd9460277371300979f8e7e449b554601b05a8bc78e4861b3fecc82d26f256
MD5 b42b35179dc4e7f050abcaa6fa1be138
BLAKE2b-256 f1eb931b4d615e01e8558e7687233d797a7b5b14b1b35075ab37c1a903313581

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.0.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.0.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 a523211eba106e18e90ecf2e86f071aa426898b82078cd9cb74d1dc0c53fcfba
MD5 2dbacd03e4fe0a6a4c004b7ced626593
BLAKE2b-256 25f483f6a778f889395ea72f776b216882cbc170cb6d2c9174a5e49eef1cb2c8

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.0.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.0.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8b17b09dd073f3e61080be6b104690b1bede05cf05d543f3ac2e6ddb5bdd14f5
MD5 aebbe1bc308ac0d4b9d0fafaab0d3571
BLAKE2b-256 b076c3376c3d8ceb4f778ee448961a146129e4cefa5e8e80b3050313485a8d11

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.0.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.0.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-1.0.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 cb2aa9a9848b042681f3bdd12d42c6d6d2a505a5a7034de5bc658e0a2483a284
MD5 0f07d1e68f7ebf0e53feb28155d35877
BLAKE2b-256 a09f416965d674d9aee225c78e8e976801a5488fa00b84364073544649c621a7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.0.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.0.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 9e987887307eb13023f176f191efa84b6a2ecb42bea2f1f15fc0a33437d99d61
MD5 2669881f42b6ba4c1ea813af4e60d20d
BLAKE2b-256 57da1f7a5b8154ed5e969c6411bd86d59ff1a8e8a463eabc382b1cbb875fbbfc

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.0.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.0.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 cb00f5ba89120f475d270620352e8872367d8cc3311dfef3894b64510fe8a67a
MD5 ec87975e91860147431ec3d2442be73d
BLAKE2b-256 227347b920abbe9dcecbb578c57e50bf09ccac052b893fecdb103bf80d0671a3

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.0.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.0.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.5

File hashes

Hashes for tensorflow_data_validation-1.0.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 c2aaf22f4602ba2d07b480446676a966cb677d872af889789db0b855f925de18
MD5 08cf00661f2369c7348d42b911cf22e3
BLAKE2b-256 accafe00fcc823e7259d54ef9766b1f366b3494080d5eefc27b86cb4bf7450cf

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.0.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.0.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 104b0b295cb037ed5b4409f2fbf35f6977ee5db6e8a492d0de2ffc1162145c2b
MD5 6fd96d94a1311bac19c0b0e9006b22f2
BLAKE2b-256 72303eba2bfae6d0400d3df814dfccac00df7969de725944e0def6c3c174e5b2

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.0.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.0.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bf6e317eed199abf1537ee56f4bdff7af069bc4b3bb29279df57edaa2273df54
MD5 56aaef82c4b9264f93cd4ebdefae1fe5
BLAKE2b-256 873e845f75f14fd500382561641621f2b39d635e69ba8fd813ce56129ce22e6a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page