Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Caution: TFDV may be backwards incompatible before version 1.0.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {35, 36, 37, 38}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.23.0 0.17.0 nightly (1.x/2.x) 0.24.0 0.24.0 0.24.0
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-0.24.0-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-0.24.0-cp38-cp38-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.24.0-cp38-cp38-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-0.24.0-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-0.24.0-cp37-cp37m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.24.0-cp37-cp37m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

tensorflow_data_validation-0.24.0-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6m Windows x86-64

tensorflow_data_validation-0.24.0-cp36-cp36m-manylinux2010_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-0.24.0-cp36-cp36m-macosx_10_9_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-0.24.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.24.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for tensorflow_data_validation-0.24.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 7efac12d2ed04573a8b8df1c3e3b4e5886b71371f6702755452b608ba20672e3
MD5 f07971edc7a8ca44e62e6cc5c40bc5d0
BLAKE2b-256 4ca362b32a95b8e770b13b0497ac727f7f63c7797a468aa2e57766a50fb90a23

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.0-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 38c32ca554f477923756b00097960756abfdccae8f867abe2df31222123776d9
MD5 6f62c9effb2d99fc80ab11d87b474620
BLAKE2b-256 b1f4f5e67ffe896c47cbbede6e65a2ff12c1e72244f6f3a155d2e1ae99935930

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5a822ec6180953c98f1668f92f9df1a02f9a1e5474b3ed213946f43170b1a7e1
MD5 9d0443f48c4d4df21aaf619708fdaaf6
BLAKE2b-256 cee82e2fa619840c17a916e785e1903c5aeefc4936d15b869cee429906118e5a

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.24.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.0

File hashes

Hashes for tensorflow_data_validation-0.24.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 52791f37d1a145590d5c2d6da9a9e229ee5c30fad6412c71ff6832ac6189e11c
MD5 ccd0de74668299884df9ab75ef79fd37
BLAKE2b-256 04565b2a253022ee1e1358b76fea4eab9202716207f2f12c8450c9fd0b6ad2d7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 488ba69610febe94986bf26e152aafca25f86d1ec202c866d6786711892dbdc6
MD5 1e7202b55f6eae0891da983e52a88788
BLAKE2b-256 5392dd294b011224f01ba5b5969b4e9b0a2f9c665b463e3cb915b55380c73968

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 74f41db901fe4cc710ab1146cc8e9ea078a1315727a286c524a71f3beec87d83
MD5 10457fb5bbcfcae647ca10848212f7de
BLAKE2b-256 2b001546320625f9fb0551f1275970c59c3231f1cfaa146abd3e0785de7351cb

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tensorflow_data_validation-0.24.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.1

File hashes

Hashes for tensorflow_data_validation-0.24.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 c5388b5cc6d810addcf8f27d00e8bde4e5db2dc5242ac566e96caf5182507f7f
MD5 5ad2e9c2f114bd1d4acdbcb917c85ac5
BLAKE2b-256 8cb48b822d0c149fea915f1fc7d000ca3dffd0be91bf1297c57d212d227a11b1

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6e94c7cde54f2e20f53eb38015784995cf294ff232384a5518fe39f0ac1def5d
MD5 26716940f142638e4eb6deb05f81dc6b
BLAKE2b-256 d43748f28d80c2e519ffd616381fe44a1938a3f45855d75926633d863d83d89e

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-0.24.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-0.24.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 65e9c56ad7f4252db7c19a11bd1dc1512bbd33f5476d9888276aeba76077ce35
MD5 1ba3454d742c2eee610dc4e6489a1b5d
BLAKE2b-256 f83a578895c509485ebe0ce068eef04d9f4b07bc703da3bdbd151d6b88cb3928

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page