Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {27, 35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

Install PyArrow

TFDV needs to be built with specific PyArrow versions ( as indicated in third_party/pyarrow.version). Install pyarrow by following these directions.

When installing please make sure to specify the compatible pyarrow version. For example:

pip install "pyarrow>=0.14.0,<0.15.0"

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy and PyArrow installed.

./configure.sh
bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.12.6 (Sierra) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Dependencies

TFDV requires TensorFlow but does not depend on the tensorflow PyPI package. See the TensorFlow install guides for instructions on how to get started with TensorFlow.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow. TFDV is designed to be extensible for other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x) 2.14.0 0.14.0
0.14.1 1.14 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.14.1-cp37-cp37m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.14.1-cp37-cp37m-macosx_10_9_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.14.1-cp36-cp36m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.14.1-cp36-cp36m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.14.1-cp36-cp36m-macosx_10_9_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.14.1-cp35-cp35m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.14.1-cp35-cp35m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.14.1-cp35-cp35m-macosx_10_6_intel.whl (2.8 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

tensorflow_data_validation-0.14.1-cp27-cp27mu-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 2.7mu manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.14.1-cp27-cp27m-macosx_10_9_x86_64.whl (2.8 MB view details)

Uploaded CPython 2.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.14.1-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c8f43ebd164703550c97d8e39e2fe1165591ef400ce97f58183a9317a5c82837
MD5 abfd1cbf7b808828bcb9ab84700b4e83
BLAKE2b-256 8cdccb414fd7ecca4fc145cee7e17d0d97d2383e236ae54b847c36f5bd72fda5

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.1-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bd65a8e3c0e792d85f87368f796e0ef0c984b461a6e11116ce9e4d8fda4697c9
MD5 8e7ed8999ae11bbb5d4640b43cd31cc2
BLAKE2b-256 1d2c36c656c4cb2eea5c834c6ded74c25ca6263a3c54678fffb499d61f2fdb27

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.14.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.0 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/2.7.16

File hashes

Hashes for tensorflow_data_validation-0.14.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 2cba18c385d7de8d346b8db4b9bfec38e8535e1371a6a7f2f375ea51264dfeb8
MD5 91e71e6c37b0512ffbd86bffafae5fc6
BLAKE2b-256 543edec2c051d4a6dd04dcacfd73d4d02be3ad3cd56008ba2251e3bd8cc36adf

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.1-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 fd68fec9a2a1742f731a948197a861e09e1f0939993039b64b3daf56b4d314e8
MD5 a116395d404d4ee1aed2b1dd1fe98658
BLAKE2b-256 d18677ec6a7c5c91ac69798fc3a9c911ff225a4c2833a42fb59d63c8162679e7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.1-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 156205ad56e786a65b39b115c7c3af4ea981c9383cb8f84947c42d6b56e15b6d
MD5 5b4685059a55c2f2788ad75323504e81
BLAKE2b-256 3da2fb311b2924568699052c718a8bc9041955b8f1773a784f164b02bd13140f

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.14.1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.0 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/2.7.16

File hashes

Hashes for tensorflow_data_validation-0.14.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 df5eb52ef53ee9db901aed5a30db183f272cda0a8b4f6981d9843cb6c52fc58a
MD5 3dee9ed34ef329d21fd70373ed1a572e
BLAKE2b-256 7713d0a90ccde514a4547b5d2ce3268f683aa6d5fb9f185c2b4d9a7db15eafca

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.1-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.1-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6a08cf22eeb8dfac805ff37f54f4f3f76b540bb3d152bd64840348c481593d92
MD5 1559f0e4531cb3f068878b49f0e5192e
BLAKE2b-256 eacf4f714f6ec2f2f764086ccb941ac964905ee39efc34decc2e73ac0485b9fe

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.1-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.1-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 18ce52cb25fe6be75138916f1f503c345548de15fa20465d3599c681aa469561
MD5 d57276441ff34b0167ae89262ee636cb
BLAKE2b-256 58427cfaf2bacf06b99a52228a684ea58dde248fb26284bfc7e83dc07dadd81b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.1-cp27-cp27mu-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.1-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 06900f329e6e23a7d92aea37a80d341301c497531a2cb52d93168b9029c25ce3
MD5 37ab4ec2498d41836e01ae0962915e34
BLAKE2b-256 35bfb5ce7a4ab497f2fe9e5e379eee2c9044f2cd7de3f53e8c29d5cc5c4ae86b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.1-cp27-cp27m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.1-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5b057745806c5f766d83cc5e479a82a740e5b9c621440efecfb46e1b58a084ed
MD5 ad7a2174ee276340247eb5dba8bd903c
BLAKE2b-256 bf5aa00402426453e425fa89d8d59e7c1764656498b234c36a09fa3f6f3765f3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page