Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.28.0 2.0.0 nightly (1.x/2.x) 0.30.0 n/a 0.30.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.30.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-0.30.0-cp38-cp38-manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.30.0-cp38-cp38-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-0.30.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.30.0-cp37-cp37m-manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.30.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.30.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.30.0-cp36-cp36m-manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.30.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.30.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.30.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-0.30.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 418443926c66e6f4781118fdbdb223fdc6505d1c3feb0a308dcc2cc3aa9b5ca6
MD5 08f004a0f5842235b261a9859aee7cee
BLAKE2b-256 9eef94b65f280fed951ac53e23e2b681db47e5c9d440c3d71ea33abb032a7a91

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.30.0-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.30.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 fe1601222a6499db2b0aa43f8c908b48a2eb403073af551be4e6f9da0830ccc5
MD5 be557d8343bcad3a29bc6fe545cbc468
BLAKE2b-256 dbc190a1aeebbd7040fd13793a2c30c4139084df6b69b77f1eb087ff29e11f27

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.30.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.30.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2a0e766432c37228d872562969d205b96aafd175ad16a57018169c0093171fac
MD5 ddade9198aecb858812a3d660f0120f7
BLAKE2b-256 f04c284a245d785c070765c6587d1b6c3a27b45d4f0a83fd08981b08fa41e2b7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.30.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.30.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-0.30.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 f65c6e6d66bffe43c7e912f4f6aa48743d9c981794a29f88e7252457ac1dfcdf
MD5 094d796bf8747b18fcc8b5f183bf7ff8
BLAKE2b-256 259162a56084247bf436181fc4e614a08e3e5140add4219ba47ddb23e5bf406b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.30.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.30.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 3f18d8546192c7cee2bfd2184a8bf90440c0147964b068b6934226ff0f21d412
MD5 1d1a65bad97c5020b98e20f34dcbc8c9
BLAKE2b-256 d86ba352a0077d902c94f4ab58e3f73d08e6b06a932b99ea8b6b385ac6d42686

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.30.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.30.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4d30ae2b5d0d234c1b2aea4678c7be5bb15e97497c0b948770f2e31367410c16
MD5 5fc1627cf54b4d699cede4da8529375f
BLAKE2b-256 17dc590644f7a2ee21703c9f3956bea08d7e69c5922bfcf6025c6f2a03f1dfbc

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.30.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.30.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.5

File hashes

Hashes for tensorflow_data_validation-0.30.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 ea3ad456109c7f31c9c68ccf8c99abec421fc84ae50365218cb2735d29748540
MD5 5306844e6041a6f063796612471e612d
BLAKE2b-256 ddaad3fb82a1b12b49d6fcfb75d957632262aa0a5a98a0c4df464c75be9aed8b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.30.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.30.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 94cf22cf7239a7c96a58cf7017b4c5012ef7c02c208adf63ecda11eefd4ff4b0
MD5 fed427e7966a5d02b6a1eee189454da2
BLAKE2b-256 2c3f40ae8c7769a8dcca3469ab01d1e64f317ad227f40aefaf12852b854d69cf

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.30.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.30.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d9cf6aced9fb72764893f7ca33fe3b9480e368e73dd7214490ab5dfea5562d6c
MD5 15cbcb164f2318271990dd872c8c67a1
BLAKE2b-256 768d764148c68cbc433b699d196e1ffd8408346cfb801d6a4585c52110cc603a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page