Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.32.0 2.0.0 nightly (1.x/2.x) 1.4.0 n/a 1.4.0
1.4.0 2.32.0 2.0.0 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.4.0-cp38-cp38-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.4.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.4.0-cp38-cp38-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-1.4.0-cp37-cp37m-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-1.4.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.4.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-1.4.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.4.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-1.4.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 5ed9bd2fb04148bb3a7b3b29648e0335d94934cdc18e1078e0532546a842b30d
MD5 6d47f926f24750a226f8a19860de3949
BLAKE2b-256 95e673bf17ccb36d81775e4189a309038307959212a35782c11f9f1aef90720e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.4.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.4.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 07da2e47bc0f4b068714b57348cc563491f8de57b536aa52f225a92c50e69c76
MD5 172d5ddd43a63d82bc09a4f810c4b06e
BLAKE2b-256 6ec62efc85947c5e701b0a79985815d25f6116eae2552a637b1a16f19b6efb04

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.4.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.4.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1ded03ea75c70e129df281617ced963c1ac489ce7068dd5b834a081715abff19
MD5 b8744ad0af4704aa9162be85e715b5ba
BLAKE2b-256 de5934aab2c91f9a6f035168d447eee9b9d8d1bcafcbc51e1dd5de2178d99bfd

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.4.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.4.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-1.4.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 0ee7011b2754608d8881891f80166111aa2c429067989d4a93264fbe14497167
MD5 adcefb0c07bd9f0368dbadfe90f0d498
BLAKE2b-256 fb02ffdd1ddec070060c922bf67ecec343f503af82c50bb3dbfef209a8c6a1fc

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.4.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.4.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f4cb8e6f13a1d5610ec64b4f70a710c30732b0b661d1763fae87535175aae9fa
MD5 6bbe28faf0030c8f4c0b54d92185c24e
BLAKE2b-256 f7e3f298f8b37d64efa21083ef4083d0eac91c10d7b6b88a3d13687bd71485c9

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.4.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.4.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0040f35c108fd8a7720e8aa274b12210d7a83fcd2e57e5ca5f5294aee6857530
MD5 24a597c28e06a3f1832667dbecfdac95
BLAKE2b-256 362e1014cc64d3dcb76816bd74b15a47d9d2873ba937310974898c6ac613680d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page