Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages on Google Cloud. To install the latest nightly package, please use the following command:

export TFX_DEPENDENCY_SELECTOR=NIGHTLY
pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Sometimes TFDV uses those dependencies' most recent changes, which are not yet released. Because of this, it is safer to use nightly versions of those dependent libraries when using nightly TFDV. Export the TFX_DEPENDENCY_SELECTOR environment variable to do so.

NOTE: These nightly packages are unstable and breakages are likely to happen. The fix could often take a week or more depending on the complexity involved.

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {38, 39, 310}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 12.5 (Monterey) or later.
  • Ubuntu 20.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.47.0 10.0.0 nightly (2.x) 1.14.0 n/a 1.14.0
1.14.0 2.47.0 10.0.0 2.13 1.14.0 n/a 1.14.0
1.13.0 2.40.0 6.0.0 2.12 1.13.1 n/a 1.13.0
1.12.0 2.40.0 6.0.0 2.11 1.12.0 n/a 1.12.0
1.11.0 2.40.0 6.0.0 1.15 / 2.10 1.11.0 n/a 1.11.0
1.10.0 2.40.0 6.0.0 1.15 / 2.9 1.10.0 n/a 1.10.1
1.9.0 2.38.0 5.0.0 1.15 / 2.9 1.9.0 n/a 1.9.0
1.8.0 2.38.0 5.0.0 1.15 / 2.8 1.8.0 n/a 1.8.0
1.7.0 2.36.0 5.0.0 1.15 / 2.8 1.7.0 n/a 1.7.0
1.6.0 2.35.0 5.0.0 1.15 / 2.7 1.6.0 n/a 1.6.0
1.5.0 2.34.0 5.0.0 1.15 / 2.7 1.5.0 n/a 1.5.0
1.4.0 2.32.0 4.0.1 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.14.0-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10 Windows x86-64

tensorflow_data_validation-1.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.14.0-cp310-cp310-macosx_12_0_x86_64.whl (20.5 MB view details)

Uploaded CPython 3.10 macOS 12.0+ x86-64

tensorflow_data_validation-1.14.0-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9 Windows x86-64

tensorflow_data_validation-1.14.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.14.0-cp39-cp39-macosx_12_0_x86_64.whl (20.5 MB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

tensorflow_data_validation-1.14.0-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.14.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.14.0-cp38-cp38-macosx_12_0_x86_64.whl (20.5 MB view details)

Uploaded CPython 3.8 macOS 12.0+ x86-64

File details

Details for the file tensorflow_data_validation-1.14.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.14.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 00b29e486c04e18ca255c764f8d788b398daf43c18a9372220c0e6d70f6d1131
MD5 02dc97ac55a1a96fc187f9b4a250c45d
BLAKE2b-256 6eea610f13945b479757dd411bfbba3efd4bcb02d1605f55b47017fdc749c63b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 568a7e0657c07b5e319b4fb72d4567e8dbf93b14ba65ff5101641de9e7b0389c
MD5 b355699e21d6c183db04547901007048
BLAKE2b-256 cdf0fec7d474d3355c2af22c4f1d0aa5add3d18049eb7db3b23cd8b92ac7a23e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.14.0-cp310-cp310-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.14.0-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 6aa9a23fb09e60d40169f7315f54354fc60bf8c06878066d83c40fcf225c50b4
MD5 3ba4e6638967c4d9ffa8c074f16ecd4c
BLAKE2b-256 6f7dc687299353681b2079d1c4ffeae5597795bf3a677f3217456b9ea71c5aaa

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.14.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.14.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 3c38b9a3d6b72c1daba516106650e9d85e72fc3ea84f6f61eef64d0da31da1e5
MD5 053f2114020800190f97840b2f6dec74
BLAKE2b-256 f041e37c06e367767dd638a2b930b085add1ca0b0bebc383ab23e2982f5c91d6

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.14.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.14.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0aa894ee574855beb0bbe4a06032515d6a261d6326958edd67b562c210601acf
MD5 1ee49ce413db71d95c795b72f4b4be66
BLAKE2b-256 0845c14a8857b2fc0a522991d3914e860f8f0bdf3ee33bcf67077eb397e35af4

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.14.0-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.14.0-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 a2f5c20b48e5e6f89bfba5c407f8e7a1f9b03ee0468db13504177499f6ad504a
MD5 509b52fb925267700ffab7a329ac8c45
BLAKE2b-256 2bf8506ca16eee138fba55e068ec87b2572cac1c0662cb1aa01fa874547ae321

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.14.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.14.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 d247d77cc0b9032aa88edb312a27252eba56897498f85a1a89a9b446fd7e099d
MD5 008d0db952022b5016fd344ac5fdad3f
BLAKE2b-256 74642f4e361f134242378d4bc7d2da52b652754c1af71bc9e73a3a0f1308f8df

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.14.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.14.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9a55edafbab96f2485138e1382c14adb3feab25d481520df697e868da5fbf2c5
MD5 2a58beb5e269949eabedb564811b263e
BLAKE2b-256 d6a60ee70ba25674199dc20226f64719bf84b9ce3b3aff2c05e7a765bcdc273c

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.14.0-cp38-cp38-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.14.0-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 148604e3d7263343a4fc9cf591743b8f022b9474f38b0cb46df943072977f2fb
MD5 4d6ebd1c6847a11619e003c3a7172434
BLAKE2b-256 677ce0b2d0b8e9d2dcd3bfd8dbc2cabbe8805f3d829ed3d1adf68f00a402e11c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page