Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {27, 35, 36, 37}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

When building on Python 2, make sure to strip the Python types in the source code using the following commands:

pip install strip-hints
python tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/

3. Build the pip package

TFDV uses Bazel to build the pip package from source. Before invoking the following commands, make sure the python in your $PATH is the one of the target version and has NumPy installed.

bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package

Note that we are assuming here that dependent packages (e.g. PyArrow) are built with a GCC older than 5.1 and use the flag D_GLIBCXX_USE_CXX11_ABI=0 to be compatible with the old std::string ABI.

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.12.6 (Sierra) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Dependencies

TFDV requires TensorFlow but does not depend on the tensorflow PyPI package. See the TensorFlow install guides for instructions on how to get started with TensorFlow.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow. TFDV is designed to be extensible for other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation tensorflow apache-beam[gcp] pyarrow
GitHub master nightly (1.x/2.x) 2.17.0 0.15.0
0.21.4 1.15 / 2.1 2.17.0 0.15.0
0.21.2 1.15 / 2.1 2.17.0 0.15.0
0.21.1 1.15 / 2.1 2.17.0 0.15.0
0.21.0 1.15 / 2.1 2.17.0 0.15.0
0.15.0 1.15 / 2.0 2.16.0 0.14.0
0.14.1 1.14 2.14.0 0.14.0
0.14.0 1.14 2.14.0 0.14.0
0.13.1 1.13 2.11.0 n/a
0.13.0 1.13 2.11.0 n/a
0.12.0 1.12 2.10.0 n/a
0.11.0 1.11 2.8.0 n/a
0.9.0 1.9 2.6.0 n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.21.4-cp37-cp37m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.21.4-cp37-cp37m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.4-cp37-cp37m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.21.4-cp36-cp36m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.21.4-cp36-cp36m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.4-cp36-cp36m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

tensorflow_data_validation-0.21.4-cp35-cp35m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.5m Windows x86-64

tensorflow_data_validation-0.21.4-cp35-cp35m-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.4-cp35-cp35m-macosx_10_6_intel.whl (2.9 MB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

tensorflow_data_validation-0.21.4-cp27-cp27mu-manylinux2010_x86_64.whl (2.4 MB view details)

Uploaded CPython 2.7mu manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.21.4-cp27-cp27m-macosx_10_9_x86_64.whl (2.9 MB view details)

Uploaded CPython 2.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.21.4-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.4-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 de37453bba6907cda3895d9d75805c28de7aa4d641bfcf05b48f7862a9fea094
MD5 db961ec1d9402c10795e55e8121a4434
BLAKE2b-256 a6433cf2f78534bc61f6ed9812c354123438544159a43e07fdea56c7996f08bb

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.4-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 2d1b2af843b119b0626f3ad2958266a04c2b1c0329986aad615dc2a5583f889e
MD5 b966db6cf3cd72065fabb41d58ae4cff
BLAKE2b-256 0f019f08453decfa8c35fe38842d24be53cb74adbfbeae0213a7b9a19d5c4948

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.4-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7a4e516f56bce6ccb85ec8e7a47392767e020f39082bfb16fb744fa5b2960e51
MD5 20717a33f5f69b113b788bd4d32132b2
BLAKE2b-256 9ea89803e9826537ec1326dc42b75623068d0b8bf041a04eabcbf89b9f4d999b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.4-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.4-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 963d53f1a3e9f4c4a93549f4c4aec512b84a573496d7a356560f00165375a994
MD5 2d89e8f08421117769a208de4cee03e3
BLAKE2b-256 6dfc52ca1f84d05c19ac47177c1048286bc9744e4a8231349d0c5b8f1f40bc21

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.4-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 7ea0269e08d3a3f68726e271c35c17d9480d7d73a8b039aba41c891351999dae
MD5 d7a45b39eea87c183dfbba4a0e1ff3cb
BLAKE2b-256 fba1de0e45e1654bd5e443e8bf37e6ce4a24a2747bc7ddd69f3a8f4d84370f1f

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.4-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5f4a2983c196332b33156411acc69f4e9b4d27c13b7741fec8ad1e98e5be92f7
MD5 63d7a7c4327d88c956b2945d9729b16f
BLAKE2b-256 1c53c05744506e439d51f5b70d9a98295ab6beea389284be48f988f96555b43e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.4-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.21.4-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.5

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 4e171449ff25a68c244104af2c1d2d3c1e2133a94634dcd5c3a11a8d90ff93eb
MD5 05c301b9ba79326b0da63f1f5e0b618d
BLAKE2b-256 7054fef2a6397357593274357e315b13611c3d2ad479497c73d6702d172c6300

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.4-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 657d31cfbf4b0bb3f45e3acbc46e63ee5aa552a6d2362fd39ce65e2cf9e8780a
MD5 9cb46e19b43ffe9e11dd48f8a253a40c
BLAKE2b-256 0dc0887e2b9112436a1b9540b1b07539a069505ce5873b254ba3999a189331f1

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.4-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 209ddd648663c7996720b5851df2d28ec8585dcbbb264c9961cfc645525fa500
MD5 4ff92836b684734cd24ed85adf3288e6
BLAKE2b-256 cb5cae0663053fafc7299427bdf9359a81d8e768946ff6caa0534fc26de22d4d

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.4-cp27-cp27mu-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 ee99162266c74f5cf7d12a168f60f560660d618429d8720fea7243bd8ae5ec62
MD5 44d6bc7e74bd6e50103ba8e6085be59e
BLAKE2b-256 332f068c42ec0b79f6dddbba18d635f573cb880e3ba6978241b78b627f4bb3eb

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.21.4-cp27-cp27m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.21.4-cp27-cp27m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 241efba9b0f3c95bb711e5ee99c82d03d4cf20b876c213a68d6c1929e3a4fe2f
MD5 1ed1ed4c32f51b1af27e2c26da5fdefb
BLAKE2b-256 e744103259d30d95875f87c0eece03480522da46563560f9af8e43a27e1f6048

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page