Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install -i https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.34.0 2.0.0 nightly (1.x/2.x) 1.5.0 n/a 1.5.0
1.5.0 2.34.0 2.0.0 1.15 / 2.7 1.5.0 n/a 1.5.0
1.4.0 2.32.0 2.0.0 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.5.0-cp38-cp38-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.5.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.5.0-cp38-cp38-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-1.5.0-cp37-cp37m-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-1.5.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.5.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-1.5.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.5.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-1.5.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ea868f6da32af2db8f7dad1e39fe924dcc8e796acc73766a5fda2c4a009cf959
MD5 de10b594dda06f3f8579cdb24b0722c2
BLAKE2b-256 9703fe46d6d8cf9a978d8e7ae77646b5e459d2fa6e857c17ce6a4ebef2f0486a

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.5.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.5.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f5b5a3a7d75ef1abf0da296829680a251a6b3081720de6c395c07c4ed823d117
MD5 9b6b6491b5e20a09e7cfd11da985d94b
BLAKE2b-256 cca717313ad2929354866ef70cdf5ffd93e9bf90fd6fc517cab7b98572e3cd0f

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.5.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.5.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e95e7c30a86de268599f0dc5e96922ed97c17a44bc6e5780645b8a58abdb4586
MD5 b243b249ebe09aa17e46c166a909cd1d
BLAKE2b-256 4ad4d50282c9cf1199f83e344ca7473cd156721c86d8ba1275a8df2cbf4c4686

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.5.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-1.5.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-1.5.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 6f6d56b01639bd2404547ed933db00b99ba2414910ab246ae4268e31486b41d1
MD5 e7fc18f084ad160bd63c1f39ee59fb05
BLAKE2b-256 d059fd56cd3eaf2e7e0d4642894cbced74e7e0b4e39fe5ef61b2d5d40c861848

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.5.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.5.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 e2ece07afac97f16422e43862d829881f1ff19ea6052d8027dd21844a75cc982
MD5 9d348c911a3f96afea8a76bd088c3e5a
BLAKE2b-256 28765bf1a8b32049af3de81cf857d2266598ca4c55582112c1571bbf3d8e7ef7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.5.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.5.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0885435ec2daaa635a5cc4ab47d6eba599d484064635294919c2ab22bdd79fae
MD5 0325528b9d19209a6b40a234be576582
BLAKE2b-256 606a6905756dc1101eca98e54f8243b6200b5cc0ec9ccc33fbf5e79d8d9f5a2d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page