Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

Then, run the following at the project root:

DOCKER_SERVICE=manylinux-python${PY_VERSION}
sudo docker-compose build ${DOCKER_SERVICE}
sudo docker-compose run ${DOCKER_SERVICE}

where PY_VERSION is one of {27, 35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

Install PyArrow

TFDV needs to be built with specific PyArrow versions ( as indicated in third_party/pyarrow.version). Install pyarrow by following these directions.

When installing please make sure to specify the compatible pyarrow version. For example:

pip install "pyarrow>=0.14.0,<0.15.0"

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy and PyArrow installed.

./configure.sh
bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.12.6 (Sierra) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Dependencies

TFDV requires TensorFlow but does not depend on the tensorflow PyPI package. See the TensorFlow install guides for instructions on how to get started with TensorFlow.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow. TFDV is designed to be extensible for other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x) 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.14.0-cp37-cp37m-macosx_10_9_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.14.0-cp36-cp36m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.14.0-cp36-cp36m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.14.0-cp36-cp36m-macosx_10_9_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.14.0-cp35-cp35m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.14.0-cp35-cp35m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.14.0-cp35-cp35m-macosx_10_6_intel.whl (2.8 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

tensorflow_data_validation-0.14.0-cp27-cp27mu-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 2.7mu manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.14.0-cp27-cp27m-macosx_10_9_x86_64.whl (2.8 MB view details)

Uploaded CPython 2.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 7d35285a6f7786fdbae8f562db67adee061dd10c7e4fb9d94015eb02c5e6e2c5
MD5 228be48c00a2ea523633999e5c4bdf1d
BLAKE2b-256 b2ca941813b533814166dff3280f63f09a7687073a4ea6c0899bab0381d899cf

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 48ee3aa92bb7841997fb645dd63de58e8ac0edc3429f8cb5547a27b1dd4a1d20
MD5 166180c7823ff25b458d599687942e4a
BLAKE2b-256 6f6a3d2b072445da20b02f096d597b33ef55031e4cdaaf530f68dd8c34ef8282

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.14.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.0 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/2.7.16

File hashes

Hashes for tensorflow_data_validation-0.14.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 611c23f718df87dcb6f34a6cf81d1a9699523254803607537e3d7e94e2c4712c
MD5 8aa75021a39a27c3dcd9617d5b4d15be
BLAKE2b-256 bb75f3112982ca379481ae7706a94bf2755bd886fd4c8386e88ab978c5a0ae52

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6a946c57b0c3b1ad60bd7cf3aee36a3420029daade2234b91f6320250e236425
MD5 221488686561f9c072ef7d5bac92a52f
BLAKE2b-256 c373b3c33f4bc6745901e4a6c6456f5fe6d8c6db332713c61ab4c1dbb1a21f54

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e58b59081efbb3e42e2a890d3c93726c6c8e57bb9810a508bb459a579f893b67
MD5 8c9fc9387968f680c3b7c59ef17c4661
BLAKE2b-256 85bf00063ae6e680907e9d1582df11870227e1b502d90c204c3539ecc9a26ce5

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.14.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.0 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/2.7.16

File hashes

Hashes for tensorflow_data_validation-0.14.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 eeff482c69ae1e49d84bbbef7c2ca058735e1d12cd640b643853f5f5fb05bc70
MD5 b2febc9d6aad414cf41613fa394228f2
BLAKE2b-256 d7a1b1f0c9c88713a60f206cf7bfaeb9391da1c9c8e3a6c98cd22078568777db

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.0-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.0-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 03300574be068157c6e0551dbadd13d62a1a8da60a1f3e047ca8ea922a95dff8
MD5 18d570beccfdeb4669aaed42f2e528e8
BLAKE2b-256 b01ec750066991c54c19b0f8917cb384b6173d0b5c8fea8cbc9f2e08ce7bd0c3

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.0-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.0-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 f154b5228490677761284ccc255352267641bad37098cb51f97908f82aba3780
MD5 0a1ff936d905912115b224996f512260
BLAKE2b-256 89e61fb8fef3b23557e488338a45f3457d5f21c281749e286a76082b2eb96449

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.0-cp27-cp27mu-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.0-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 90eb98b452edfea6216bc19ae9c82734f38cc6e1fe2fc879acca1393125189f6
MD5 c3350bc021436edbe4202d492944d4df
BLAKE2b-256 b6230401328d53eca6b085198f161d1cd4a9cb85efd086216a5c528c1260ebef

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.14.0-cp27-cp27m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.14.0-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4fb37994ce95bddacb12e5670b873ca4a092d2028c2d353021c8c8176824aceb
MD5 04b6d8d95e67df47c2bf9b487bc09375
BLAKE2b-256 2b46d4b344e2a84e2e87bce3d3386ffbe8b770c3ad588e59399f12a21f4191c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page