Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {27, 35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy installed.

bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.12.6 (Sierra) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Dependencies

TFDV requires TensorFlow but does not depend on the tensorflow PyPI package. See the TensorFlow install guides for instructions on how to get started with TensorFlow.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow. TFDV is designed to be extensible for other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x/2.x) 2.16.0 0.14.0
0.15.0 1.15 / 2.0 2.16.0 0.14.0
0.14.1 1.14 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.15.0-cp37-cp37m-manylinux2010_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.15.0-cp37-cp37m-macosx_10_9_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.15.0-cp36-cp36m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.15.0-cp36-cp36m-manylinux2010_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.15.0-cp36-cp36m-macosx_10_9_x86_64.whl (2.8 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.15.0-cp35-cp35m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.15.0-cp35-cp35m-manylinux2010_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.15.0-cp35-cp35m-macosx_10_6_intel.whl (2.8 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

tensorflow_data_validation-0.15.0-cp27-cp27mu-manylinux2010_x86_64.whl (2.3 MB view details)

Uploaded CPython 2.7mu manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.15.0-cp27-cp27m-macosx_10_9_x86_64.whl (2.8 MB view details)

Uploaded CPython 2.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.15.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.15.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 0738d11740893e133dadefa90ac3478c7ee418d8a803c77168f1b634d286cb9f
MD5 e1b83fbe71823a8336ec8014049efbff
BLAKE2b-256 abeb31aa15dc78a98c891e061869895d7f204a8a5eb30c4d766fc1de9ab83842

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.15.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.15.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 87c4e73d246f1ba01d77f3dc91ceb7dfcb9616d74a64e2128757e2e9c739cc7a
MD5 7fbfc0cc9fd33cf6a1fc847bd391879a
BLAKE2b-256 267bf521f537fbe081a41342b06bccbdafeb3604dbce262826faaaa3df172512

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.15.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.15.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/2.7.16

File hashes

Hashes for tensorflow_data_validation-0.15.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 314e349a4151cbde8886679aff2e35135ccc880c478a00f23d04d8aebb46d453
MD5 ce9f6e6019b424331c4fa972742bc4e3
BLAKE2b-256 ba1e759da3ee094715d5d699c9b7738256d69a440ed5586de7493660417afe9d

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.15.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.15.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 afdaf98ccc650109efa52a6351585c45ab258f300f3d12193708c58d34f114e5
MD5 77acdf351588c154b3f4eb117b1de08c
BLAKE2b-256 3e1e0542f5d28066ac59d3e3efa9f7389ef42ab69393e242187740d49edd3e3c

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.15.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.15.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 99dec42494a11c8e0babc3141b86e9c52dfacc075a532c11bcc001be21eabc10
MD5 472289184e9e50b75d27cb3c86711f09
BLAKE2b-256 88736fa142a1ebfe05c8ea20c358a508985035fccdedab4986020d7afb03a3c9

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.15.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.15.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/2.7.16

File hashes

Hashes for tensorflow_data_validation-0.15.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 8e80f6114cec610947e8b3ff83287bb7ae9af016c424485440788e674c3f9a2a
MD5 488d1d10a8dc01a4c93599821f9d6910
BLAKE2b-256 066a4895cfbc8602a7ad56028f5b4738482bba655ded22339125def6c9882f3f

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.15.0-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.15.0-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 3a29da5731d089467f7cbee250d3234796b3402706ced222d05f59f2e1d0643c
MD5 d89fb354f765b265f838f32d8121ab27
BLAKE2b-256 ac21103c4a7ba043ab1f93d8ccf01bb24cdfcc19882e8d14981b4141630c0f91

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.15.0-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.15.0-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 43e0ac1c8dce2b47a57af7c41114bec1af5abbc36224b5c51ca87b8878b8f9b3
MD5 55237235bbd5eb5c86f8e1cc72c0fded
BLAKE2b-256 7e1ad9de2d20388b25c39a0efb71a38d483c8e0f643dfcd5731bfd1d20ee53a6

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.15.0-cp27-cp27mu-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.15.0-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 68164955f1e28ae44c8a7022363a1b335fb67ceabb89910760aa41fa364ce168
MD5 db426978192ac11d6a685b5628552fba
BLAKE2b-256 e69c1f0d4f75deb87af13bf9469c003fd879ec58fff65cf682c08eca06585434

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.15.0-cp27-cp27m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.15.0-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 668608f8c6249f63e25379b0573541bb0ec3b8069720e8254956ca6a159bc0aa
MD5 771a8de7971d14367f576a7f6a553d87
BLAKE2b-256 ac7dcfcff16c21a4238241632892b919ba6c5df5b653372127b0dd6bfe203fd5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page