Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages on Google Cloud. To install the latest nightly package, please use the following command:

export TFX_DEPENDENCY_SELECTOR=NIGHTLY
pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Sometimes TFDV uses those dependencies' most recent changes, which are not yet released. Because of this, it is safer to use nightly versions of those dependent libraries when using nightly TFDV. Export the TFX_DEPENDENCY_SELECTOR environment variable to do so.

NOTE: These nightly packages are unstable and breakages are likely to happen. The fix could often take a week or more depending on the complexity involved.

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {39, 310, 311}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 12.5 (Monterey) or later.
  • Ubuntu 20.04 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.59.0 10.0.1 nightly (2.x) 1.16.0 n/a 1.16.0
1.16.0 2.59.0 10.0.1 2.16 1.16.0 n/a 1.16.0
1.15.1 2.47.0 10.0.0 2.15 1.15.0 n/a 1.15.1
1.15.0 2.47.0 10.0.0 2.15 1.15.0 n/a 1.15.0
1.14.0 2.47.0 10.0.0 2.13 1.14.0 n/a 1.14.0
1.13.0 2.40.0 6.0.0 2.12 1.13.1 n/a 1.13.0
1.12.0 2.40.0 6.0.0 2.11 1.12.0 n/a 1.12.0
1.11.0 2.40.0 6.0.0 1.15 / 2.10 1.11.0 n/a 1.11.0
1.10.0 2.40.0 6.0.0 1.15 / 2.9 1.10.0 n/a 1.10.1
1.9.0 2.38.0 5.0.0 1.15 / 2.9 1.9.0 n/a 1.9.0
1.8.0 2.38.0 5.0.0 1.15 / 2.8 1.8.0 n/a 1.8.0
1.7.0 2.36.0 5.0.0 1.15 / 2.8 1.7.0 n/a 1.7.0
1.6.0 2.35.0 5.0.0 1.15 / 2.7 1.6.0 n/a 1.6.0
1.5.0 2.34.0 5.0.0 1.15 / 2.7 1.5.0 n/a 1.5.0
1.4.0 2.32.0 4.0.1 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.16.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.16.1-cp311-cp311-macosx_12_0_x86_64.whl (20.2 MB view details)

Uploaded CPython 3.11 macOS 12.0+ x86-64

tensorflow_data_validation-1.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.16.1-cp310-cp310-macosx_12_0_x86_64.whl (20.2 MB view details)

Uploaded CPython 3.10 macOS 12.0+ x86-64

tensorflow_data_validation-1.16.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.16.1-cp39-cp39-macosx_12_0_x86_64.whl (20.2 MB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

File details

Details for the file tensorflow_data_validation-1.16.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bb2e666e724a418fb45cca17442d33f906e91c086ad4a897aaf4473c94e0f4ee
MD5 7572efd83c4b1797f720533ac072c534
BLAKE2b-256 da4d4e758f700f1e1b0162ff74c0dca0f0ebd8e366e57088a29ec1f3bdaf0287

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.16.1-cp311-cp311-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.1-cp311-cp311-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 b21fa86c61da5cee81b4d602953fea16878de4874eb6035bf7f3221cfeb91559
MD5 2e3f1db49ccd1034c8dbfb51a6aba5b6
BLAKE2b-256 c9d88b193132c8769d31d11e92b058cd6651b5d8cba1b91878665bdb7408260b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e0500a0ece8d2f0539a35243b765f5f227f5b6bfab1b1e5e0ac8cb5a9e7a47a2
MD5 1c1b1a1a755256578ba55e8d5b308cdc
BLAKE2b-256 e65ff35035a955b56e78910880d07fb7d18b8f42ae20586ba027ca210bcd2aea

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.16.1-cp310-cp310-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.1-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 b6e77fb1b9a16780866c07220d2f522774a29e56de83fe518f00bac886a5b185
MD5 f963271651a9f137c9663c63d3d400e1
BLAKE2b-256 5532b6efa864162b60346a3dce9959d99cde0f2d9f01c28f6006e9edd9a1cf4e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.16.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6169d46fc316c19b3401d1b3f255e6ae8fc45b4ea5c03bd2805dbed9656975b9
MD5 27e06a3e5885fef62c4d086251784ddb
BLAKE2b-256 14dadaddb7b74a6b539674fd92be04a877e82bb0b919a09a13ed3aadb5af4f30

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.16.1-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.1-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 cc25d5d5bba548425dad033fd8dad875f4eefabaaf611acd2ed57c069198b63d
MD5 6476aae9415bf7d18db483d257c08aff
BLAKE2b-256 de9dc3c1cd8ccf1281f60452c2add5c09a5a3f64533c91fecf6aa5ad1b93b2fb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page