Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {27, 35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy installed.

bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.12.6 (Sierra) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Dependencies

TFDV requires TensorFlow but does not depend on the tensorflow PyPI package. See the TensorFlow install guides for instructions on how to get started with TensorFlow.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow. TFDV is designed to be extensible for other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x/2.x) 2.17.0 0.15.0
0.21.2 1.15 / 2.1 2.17.0 0.15.0
0.21.1 1.15 / 2.1 2.17.0 0.15.0
0.21.0 1.15 / 2.1 2.17.0 0.15.0
0.15.0 1.15 / 2.0 2.16.0 0.14.0
0.14.1 1.14 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.21.2-cp37-cp37m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.21.2-cp37-cp37m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.2-cp37-cp37m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.21.2-cp36-cp36m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.21.2-cp36-cp36m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.2-cp36-cp36m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.21.2-cp35-cp35m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.21.2-cp35-cp35m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.2-cp35-cp35m-macosx_10_6_intel.whl (2.9 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

tensorflow_data_validation-0.21.2-cp27-cp27mu-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 2.7mu manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.2-cp27-cp27m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 2.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.21.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5rc1

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 9b35b4a78a762cb446cd59ee90d3e7f718a6cd444ddf85cd668a4492a887acf8
MD5 575310dc5ac53743b389accbd41fa724
BLAKE2b-256 5aeb7131cda867bb59a8b0afae86d3f99eca3fafa686deeb9f6d30715514c663

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.2-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 4eee7f0ab186b45a20ef791abd9128623fccb57a05078afcd75084f7912f48bd
MD5 35f135433875a59df043a8c2b2ec5bde
BLAKE2b-256 b706770a9f8e703cc45ae7e9f42c41eef0d8aebf3c6fd201ecbec951440bd7ee

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.2-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d2b61bedc7dc2e925010ebe56e85fc02eec9a74a2ea45b71f3353ad9e06c1321
MD5 1c93ef8b2bd55bdb66f351d958403ad4
BLAKE2b-256 be757cf0c2bd4fd36e2c246c3df0edfc93b291c7f6b1386a5513efe1ec8bc783

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5rc1

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 8a1f91672b15f27a4544c9a090dbb00970ffb5f7681dbfb8414c740f10f2beaf
MD5 1db60f2a6e57e46b3636f4feed2c4a31
BLAKE2b-256 443dca18e9eab81f4394c2da65f95ad2c41ded703475ff01a620a646b493ab6d

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.2-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 678ada655f33bfbf31fe7838d8e075967f5801753d234855ba3793ae4f794724
MD5 b0557555e4ef205e6806d3808a21e54a
BLAKE2b-256 13322c922b6c922d101ede12b2c169187aaeb19407f3447b10131383b43c966f

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.2-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1f97535ba70dbfa5a31e6c1c3c0fd52fb80b58835675c642104123499f52b5d1
MD5 5ed0439088151f08f351729e2808c0fe
BLAKE2b-256 3bd4d49fe6cd4282401e575d255f82adc9ef9d736f7634ed57d62580c2dc981e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.2-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.2-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5rc1

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 0e5211dbc0edb3c23435c940ad518611e4e575cfcccdf43bf5205dc7776c1372
MD5 9306f0b54f93f9f75840743324e16222
BLAKE2b-256 ec073a3f641d9b71fc006f34d9cbece1d063c2dc654018cbd40080802c6c9dc4

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.2-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 bb79e7fcee4ddeefb94a4944721b082644b413af241df6601212aa92d566b13e
MD5 6f378401fb7a15113850d190c5b81b2e
BLAKE2b-256 9890b2c587aa1e9a5e0cf6f565f15962e97e4fa12bac020b207fab84885d3904

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.2-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 f7b2e1dd1639105c3100ba4635292f18ba12e31a1acae7aed93758d2fa60f6af
MD5 e02d995f6052caf4fe5e12795153c7ba
BLAKE2b-256 485c2aad22c6e22d2acc6f11e406ff8fcddc281fbb566e1943949eb4f52ac692

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.2-cp27-cp27mu-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b1d2824137546c5d0718ca95d4b3a10971db651f6bf557eadd133757aea9a4aa
MD5 f67af8b4a98262b47699d6a3d1476cbb
BLAKE2b-256 211d93ba500a051b535a39cb01809886e225694c48f39f97d5fa1e0ca799a0c4

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.2-cp27-cp27m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.2-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5be142604fbfd8c73305281f6cafc343c05ee19e5fd853886207aa8287d5f43e
MD5 7bbeb826af98ee3b981e78b6328da33e
BLAKE2b-256 f1a42d61949b076aa674583306c231476c013ed73b77a371786b167b0e009915

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page