Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TensorFlow Transform (TFT), TFX Basic Shared Libraries (TFX-BSL), TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.25.0 0.17.0 nightly (1.x/2.x) 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.26.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-0.26.0-cp38-cp38-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.26.0-cp38-cp38-macosx_10_9_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-0.26.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.26.0-cp37-cp37m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.26.0-cp37-cp37m-macosx_10_9_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.26.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.26.0-cp36-cp36m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.26.0-cp36-cp36m-macosx_10_9_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.26.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.26.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-0.26.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ffdcd4d9a6d10fd36b796438e6a5834e7fffa0f502a963a76fe5cb7a87c56569
MD5 1773735c21152064393cbd4b303548cc
BLAKE2b-256 4f44e0b109b44b50075647294bf67ea3723f7b68f952592d3425a7c6287751e2

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.26.0-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.26.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 db58e0b0a9cb75eba099bb771512306b2ff0e76ee7f9e611d34b3b8adc7db351
MD5 2c9ce25748d1b0854c0437833d7a585b
BLAKE2b-256 6979b620c87da1e262b87bba5530134d1d6d34368943520a06a9b0d1fd51162d

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.26.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.26.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 496509b8b72a988202ad032ad00ab4afc2895379798526a365028ba7ae8b5521
MD5 7de2235942774da3c7728be3bd2b5d63
BLAKE2b-256 7dfef4b9b5b7955ce6ab447ba2b33532991fe4d2b838d1d225cac004722a3991

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.26.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.26.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-0.26.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 ce64f5752541ef96b7b8c0b6ae245bb58ccb84360c19e5de0aa6cf46b7394261
MD5 93e483a2a01b468152541604f8445c38
BLAKE2b-256 75d1c005de80eb495145615b81ddd1280c7a699a4533846094bf27a6b136cfe5

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.26.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.26.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 d2b72f5b9e1344ea8c7cad1f537b0e97f748dd11ade29aaa930fad90d335f5cd
MD5 31e9ec8c37470c1aa1a74330cd13b0c8
BLAKE2b-256 019afaa0b345ddd6266a36a210b2e23f861757dc2834639940c16a61f66ed404

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.26.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.26.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f8395239adf5cca4f91dfa91db944daa0ff42a62b3ad4cc540ba7dfa4d4f2591
MD5 7dc49306db2eb43f767602fc6b50e3c0
BLAKE2b-256 8ea4c54507f9ca93c1d89d2c3e77b59fc7b4053b74a1bc8e17601c40bfe2cf6d

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.26.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.26.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.6.1

File hashes

Hashes for tensorflow_data_validation-0.26.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 1257abba94bc0201d7d024734147b0425ec6abbedad5be1d0e69a9e36dd215c0
MD5 2263939a46dc24ce3d4a687b28e17784
BLAKE2b-256 aa2c5d1c2b4ef264cac7f2c3234a6d2dff0abf3a3ef36e8eb922509f60b3d2ab

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.26.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.26.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6820b94e525de5ff7954d48c8534774fe8e347bd423e6a0d7e16c87c6d2ba920
MD5 3188f8ec721ffcfc3164909fa86f5632
BLAKE2b-256 4c1c31d558bc1c96a13ba630d3fcbc5fd7de93e5ddd7f9d82fb9da0b64e1a210

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.26.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.26.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4421e972c575712424e52e69e2a1975e9e88cdca46772d67111a7ea48dcf4f77
MD5 04dabbfa76e69485b30e4fb4637c9726
BLAKE2b-256 85661537e0a0c2b4510739aa40e9d1c4c2ba0d37cb67a6228088ab988ededb28

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page