Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {27, 35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy installed.

bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.12.6 (Sierra) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Dependencies

TFDV requires TensorFlow but does not depend on the tensorflow PyPI package. See the TensorFlow install guides for instructions on how to get started with TensorFlow.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow. TFDV is designed to be extensible for other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x/2.x) 2.17.0 0.15.0
0.21.0 1.15 / 2.1 2.17.0 0.15.0
0.15.0 1.15 / 2.0 2.16.0 0.14.0
0.14.1 1.14 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.21.0-cp37-cp37m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.21.0-cp37-cp37m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.0-cp37-cp37m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.21.0-cp36-cp36m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.21.0-cp36-cp36m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.0-cp36-cp36m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.21.0-cp35-cp35m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.21.0-cp35-cp35m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.0-cp35-cp35m-macosx_10_6_intel.whl (2.9 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

tensorflow_data_validation-0.21.0-cp27-cp27mu-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 2.7mu manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.0-cp27-cp27m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 2.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.21.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5rc1

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 a3f32420f7277eee13ec880098a13ea60432a477a7c91e0a108de454979b0080
MD5 823fce9eddeaf7fd26f4487a9eb45a47
BLAKE2b-256 6511ed4d88f5b1558876aceef6f515aa467769d3f729fad0a51a84e172fa9036

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 5aef862691558e3297666d09a7bd5e65fbf96990c435ad70bd574f09ebf15af0
MD5 a212767414f5e0ae57b9d7fe3b16f1e0
BLAKE2b-256 3287352fd875ca528f7121fcefa7b4affd4177a80bed75ca24ba41047a4cfeaf

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c387ed1e1e168ddc2d1683d3675e233ac4a48fd94f1d0891b5d35443efe66404
MD5 9dc194b23f766d57b99828c5b456a44e
BLAKE2b-256 c3b62aa1999599fa01aebbd84da495df27d62e5bfc45f56bcae8c59a05797137

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5rc1

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 be8c4f5dc8299be1d5efcc9f7b9f44b3de3398503236faf2e8a8f379807bce83
MD5 a6afcc17c06fb897d514a7da0daaf3d9
BLAKE2b-256 8a7773ea2b294be4a07602969353e6a05004a5baa415659a44889ca9530dcfe4

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 3970081746a1b8616822ea15fd1ec9b42f821cd27d0e4136999183b2d7f1f7cc
MD5 e7a078b4bcfabfaa1a82ac4691a89823
BLAKE2b-256 ef1f1c1fab1550e62d80f63cee324c0c24b77a4a9413de722af1ca5afade1043

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 32e00a7eb33d36212150876d52be00a3680ca4ad0fa82bf4338295570b9100f7
MD5 b5d22ff5570fe655e508624c18f260ba
BLAKE2b-256 50b4e17850d780dc491b27144d504c27d3fc4e4cfaa0fbc3f0a9f4fe9b0446c3

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5rc1

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 3634bede2b3b18a7f1a0fa400d4766c539a156bee27c69c1e9bd51c1c87ac529
MD5 8ae569a84a5b502caab66e7b02e56b74
BLAKE2b-256 2e4a8deab495953f2ec051aa8594853825c597089e271ee22e347b48383c64ec

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.0-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 fd03f3625dc8c78cd5de5b892b7725e6f5e18230f3253e10810fadbd9a9127c5
MD5 c9f6d0d7dd9ef82056fc2337649a8b1c
BLAKE2b-256 a3961f15085aa6bbacf8e167b86609ed9d329681abe914f54205b18a841c4528

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.0-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 16a5f6cf8c1080ff46c1682ad920e6a209f01cb41652e73b14f9105556d6b677
MD5 9177a0dd15c1e7b8a8c9f7d237380358
BLAKE2b-256 d8abd2077bd868ea8ce0e10e0b33fa57a1c0cf7098d07e8c4b5be8222a0367e2

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.0-cp27-cp27mu-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 900c90105115afaa65783dde2d2df5a65db6da387ebec0f42df6cb132ced5640
MD5 cc1f5044c110a54ad6f9eeebea3a647f
BLAKE2b-256 5b130052158fb8acc9c9b5604d0a23b64e47b6f3343a038e3ce3185962a1d630

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.0-cp27-cp27m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.0-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9b42c5309b973bd07d81b2af89c95691e42d2bc32b7130b5a57bad9935060b48
MD5 906a90fc8d0dcf7f27eddd007600deca
BLAKE2b-256 6d1ce47264a490e2d9f7fa20e94cdb92195aa1bb42217863894b3d49c936a8c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page