Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages on Google Cloud. To install the latest nightly package, please use the following command:

export TFX_DEPENDENCY_SELECTOR=NIGHTLY
pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Sometimes TFDV uses those dependencies' most recent changes, which are not yet released. Because of this, it is safer to use nightly versions of those dependent libraries when using nightly TFDV. Export the TFX_DEPENDENCY_SELECTOR environment variable to do so.

NOTE: These nightly packages are unstable and breakages are likely to happen. The fix could often take a week or more depending on the complexity involved.

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {39, 310, 311}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 12.5 (Monterey) or later.
  • Ubuntu 20.04 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.59.0 10.0.1 nightly (2.x) 1.16.0 n/a 1.16.0
1.16.0 2.59.0 10.0.1 2.16 1.16.0 n/a 1.16.0
1.15.1 2.47.0 10.0.0 2.15 1.15.0 n/a 1.15.1
1.15.0 2.47.0 10.0.0 2.15 1.15.0 n/a 1.15.0
1.14.0 2.47.0 10.0.0 2.13 1.14.0 n/a 1.14.0
1.13.0 2.40.0 6.0.0 2.12 1.13.1 n/a 1.13.0
1.12.0 2.40.0 6.0.0 2.11 1.12.0 n/a 1.12.0
1.11.0 2.40.0 6.0.0 1.15 / 2.10 1.11.0 n/a 1.11.0
1.10.0 2.40.0 6.0.0 1.15 / 2.9 1.10.0 n/a 1.10.1
1.9.0 2.38.0 5.0.0 1.15 / 2.9 1.9.0 n/a 1.9.0
1.8.0 2.38.0 5.0.0 1.15 / 2.8 1.8.0 n/a 1.8.0
1.7.0 2.36.0 5.0.0 1.15 / 2.8 1.7.0 n/a 1.7.0
1.6.0 2.35.0 5.0.0 1.15 / 2.7 1.6.0 n/a 1.6.0
1.5.0 2.34.0 5.0.0 1.15 / 2.7 1.5.0 n/a 1.5.0
1.4.0 2.32.0 4.0.1 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.16.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.16.0-cp311-cp311-macosx_12_0_x86_64.whl (20.2 MB view details)

Uploaded CPython 3.11 macOS 12.0+ x86-64

tensorflow_data_validation-1.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.16.0-cp310-cp310-macosx_12_0_x86_64.whl (20.2 MB view details)

Uploaded CPython 3.10 macOS 12.0+ x86-64

tensorflow_data_validation-1.16.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

tensorflow_data_validation-1.16.0-cp39-cp39-macosx_12_0_x86_64.whl (20.2 MB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

File details

Details for the file tensorflow_data_validation-1.16.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c3d1e784fe0526c9ea205bc305ab52e8dd8486599b75d392185d41eebbb65469
MD5 d897d00e9b2fb4488eb1a9d24b884628
BLAKE2b-256 24130fcf7708ebfeb4627cd371dff44011ae1bb5bd545140ce390e70002f1ed7

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.16.0-cp311-cp311-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.0-cp311-cp311-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 ad60c94d6d3dfe847da55ede2eefe4a63f75a82a4525a2f85da0ee8f22fbd24c
MD5 46d2aaca7bde41b65f6a865da3c98dbd
BLAKE2b-256 0b815d74114f59715cab0da25727289a823e4bd5035610a18c5e782e28f6e430

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3eb5a20bdecac065702ad38f4a883d89a3dc58d114c5a5973582228635c656d6
MD5 9a038bb1bff5f13d19925c7dbebc372e
BLAKE2b-256 47d2676cc1d467beb723cae244dc374cd12e9d2c5e4bd28c4aadd5207d4cfa82

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.16.0-cp310-cp310-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.0-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 7bccd86d9b603adfe0d1460f6158d7b5d30e3b70979ca62dea2f7de3ef00ad42
MD5 6d9d399a234ac6118c1b221206fd73ee
BLAKE2b-256 1fe6209b95b024f321176f6b30238b3870c5bf57a8c77d1d9630cf0fd860f215

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.16.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a96796b83400a950b28824dcde3dd6cb4e23f0157e90d487e3933a39c265e150
MD5 47fb2e3f6c9d09b283a13b60ade614b6
BLAKE2b-256 1e6928af641bb453ac7251036051c9fbd49f8e7c5c5af4c4f82a59ff8554a4dd

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.16.0-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.16.0-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 ce10665b82cb09467e29dd3df205ed60144cbf00dd1818dfef9fac7736c3970b
MD5 153bf99c7158a2bb06b7278530e3e5e0
BLAKE2b-256 f3f0a423eed2ccd5081c8f9bf6ac0c5030cca6f652407a4129896eb8b7f72caf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page