Skip to main content

A library for exploring and validating machine learning data.

Project description

TensorFlow Data Validation

Python PyPI Documentation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

TF Data Validation includes:

  • Scalable calculation of summary statistics of training and test data.
  • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)
  • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
  • A schema viewer to help you inspect the schema.
  • Anomaly detection to identify anomalies, such as missing features, out-of-range values, or wrong feature types, to name a few.
  • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.

For instructions on using TFDV, see the get started guide and try out the example notebook. Some of the techniques implemented in TFDV are described in a technical paper published in SysML'19.

Installing from PyPI

The recommended way to install TFDV is using the PyPI package:

pip install tensorflow-data-validation

Nightly Packages

TFDV also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-data-validation

This will install the nightly packages for the major dependencies of TFDV such as TFX Basic Shared Libraries (TFX-BSL) and TensorFlow Metadata (TFMD).

Build with Docker

This is the recommended way to build TFDV under Linux, and is continuously tested at Google.

1. Install Docker

Please first install docker and docker-compose by following the directions: docker; docker-compose.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

Then, run the following at the project root:

sudo docker-compose build manylinux2010
sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010

where PYTHON_VERSION is one of {37, 38, 39}.

A wheel will be produced under dist/.

4. Install the pip package

pip install dist/*.whl

Build from source

1. Prerequisites

To compile and use TFDV, you need to set up some prerequisites.

Install NumPy

If NumPy is not installed on your system, install it now by following these directions.

Install Bazel

If Bazel is not installed on your system, install it now by following these directions.

2. Clone the TFDV repository

git clone https://github.com/tensorflow/data-validation
cd data-validation

Note that these instructions will install the latest master branch of TensorFlow Data Validation. If you want to install a specific branch (such as a release branch), pass -b <branchname> to the git clone command.

3. Build the pip package

TFDV wheel is Python version dependent -- to build the pip package that works for a specific Python version, use that Python binary to run:

python setup.py bdist_wheel

You can find the generated .whl file in the dist subdirectory.

4. Install the pip package

pip install dist/*.whl

Supported platforms

TFDV is tested on the following 64-bit operating systems:

  • macOS 10.14.6 (Mojave) or later.
  • Ubuntu 16.04 or later.
  • Windows 7 or later.

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFDV uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table shows the package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-data-validation apache-beam[gcp] pyarrow tensorflow tensorflow-metadata tensorflow-transform tfx-bsl
GitHub master 2.38.0 5.0.0 nightly (1.x/2.x) 1.9.0 n/a 1.9.0
1.9.0 2.38.0 5.0.0 1.15 / 2.9 1.9.0 n/a 1.9.0
1.8.0 2.38.0 5.0.0 1.15 / 2.8 1.8.0 n/a 1.8.0
1.7.0 2.36.0 5.0.0 1.15 / 2.8 1.7.0 n/a 1.7.0
1.6.0 2.35.0 5.0.0 1.15 / 2.7 1.6.0 n/a 1.6.0
1.5.0 2.34.0 5.0.0 1.15 / 2.7 1.5.0 n/a 1.5.0
1.4.0 2.32.0 4.0.1 1.15 / 2.6 1.4.0 n/a 1.4.0
1.3.0 2.32.0 2.0.0 1.15 / 2.6 1.2.0 n/a 1.3.0
1.2.0 2.31.0 2.0.0 1.15 / 2.5 1.2.0 n/a 1.2.0
1.1.1 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.1
1.1.0 2.29.0 2.0.0 1.15 / 2.5 1.1.0 n/a 1.1.0
1.0.0 2.29.0 2.0.0 1.15 / 2.5 1.0.0 n/a 1.0.0
0.30.0 2.28.0 2.0.0 1.15 / 2.4 0.30.0 n/a 0.30.0
0.29.0 2.28.0 2.0.0 1.15 / 2.4 0.29.0 n/a 0.29.0
0.28.0 2.28.0 2.0.0 1.15 / 2.4 0.28.0 n/a 0.28.1
0.27.0 2.27.0 2.0.0 1.15 / 2.4 0.27.0 n/a 0.27.0
0.26.1 2.28.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.26.0 2.25.0 0.17.0 1.15 / 2.3 0.26.0 0.26.0 0.26.0
0.25.0 2.25.0 0.17.0 1.15 / 2.3 0.25.0 0.25.0 0.25.0
0.24.1 2.24.0 0.17.0 1.15 / 2.3 0.24.0 0.24.1 0.24.1
0.24.0 2.23.0 0.17.0 1.15 / 2.3 0.24.0 0.24.0 0.24.0
0.23.1 2.24.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.23.0 2.23.0 0.17.0 1.15 / 2.3 0.23.0 0.23.0 0.23.0
0.22.2 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.1 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.1
0.22.0 2.20.0 0.16.0 1.15 / 2.2 0.22.0 0.22.0 0.22.0
0.21.5 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.4 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.1 0.21.3
0.21.2 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.1 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.21.0 2.17.0 0.15.0 1.15 / 2.1 0.21.0 0.21.0 0.21.0
0.15.0 2.16.0 0.14.0 1.15 / 2.0 0.15.0 0.15.0 0.15.0
0.14.1 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.14.0 2.14.0 0.14.0 1.14 0.14.0 0.14.0 n/a
0.13.1 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.13.0 2.11.0 n/a 1.13 0.12.1 0.13.0 n/a
0.12.0 2.10.0 n/a 1.12 0.12.1 0.12.0 n/a
0.11.0 2.8.0 n/a 1.11 0.9.0 0.11.0 n/a
0.9.0 2.6.0 n/a 1.9 n/a n/a n/a

Questions

Please direct any questions about working with TF Data Validation to Stack Overflow using the tensorflow-data-validation tag.

Links

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_data_validation-1.9.0-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9 Windows x86-64

tensorflow_data_validation-1.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.9.0-cp39-cp39-macosx_10_14_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9 macOS 10.14+ x86-64

tensorflow_data_validation-1.9.0-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8 Windows x86-64

tensorflow_data_validation-1.9.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.9.0-cp38-cp38-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

tensorflow_data_validation-1.9.0-cp37-cp37m-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.7m Windows x86-64

tensorflow_data_validation-1.9.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

tensorflow_data_validation-1.9.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file tensorflow_data_validation-1.9.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.9.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 339c7b851b60c0b2dc9320d6f950427c3ec2474c9efde89a9436a0b1eda51fe5
MD5 ae18e39f637191587ead587418105e36
BLAKE2b-256 60b3b670c51e29891c33c5e5800b0ab47688da94c9ef515636a1eddd10df60fa

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 fb2252032a4766324c21ab2ad4bf034c0f30c24c0964afe7981b20b2f7ca241a
MD5 7d805fb0b32cdc110c660615d0b4af74
BLAKE2b-256 66355f465801b20b005942a56033ea1024f74913054d3a786050cd82282c94e9

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.9.0-cp39-cp39-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.9.0-cp39-cp39-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 b9385eee6d603d4bba818db4807ca9f3fdb44bba37a14fac35fb4fd075b5c00d
MD5 6504c0a1bfb53cf6b7dcdf59bcbdb0a4
BLAKE2b-256 c0738d871ecb1a46716a9ea7c7c0afef482c28bf7f50e0d19aed7e146c6c047a

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.9.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.9.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 b9972c1f4c13f3c0a093e44df1f9a487ac6b108813792026cd6ba80d2d0aa673
MD5 2c588b849d7035426418c765947b23f1
BLAKE2b-256 b4df5a163f0302f219d8b9fa70318534aa76206efa97a87bb06551a77b86604b

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.9.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.9.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f419bbb419d780bcd78ebdbcf6c43e70d4d71ac296e6e63149a5fdc630f17f91
MD5 4f80a1c2b5bed9de959b3f3c150ca527
BLAKE2b-256 8de4769e0261993cd31ea65f96bbfe9b1b1f477cd8fd70bf528d308ef53a5734

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.9.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.9.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f709b32d4de3140c152c66f696803e41d1e8a856de53230e0e5bfbcfdd5aa91e
MD5 0c3e128c9ea24f36899a6c7d7a95e292
BLAKE2b-256 a4068d917a7da286a87c6533827f53b6f06eee4e070f07aed22793616390ebe5

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.9.0-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.9.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 1a331d2174867860e8ab67bbf7aa42108789e2b23a31f568c74760e939a753bb
MD5 58fee8c41b6a4dc9d11039cc61d5d194
BLAKE2b-256 efbce2fa89447dee70176ed15e6d72d116b93979e97df2e801bff9545bbab02d

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.9.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.9.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 9c74d46fbd61b4d479585c9dbda7d41134038d4793003814bd97b9501a8dc207
MD5 41665f8c3664b82684239c7bcfd6fc14
BLAKE2b-256 0110f5097654cbd02fecfe9f8fe0abfc3204ce9974f8ccca4c04458ae9fde6f4

See more details on using hashes here.

File details

Details for the file tensorflow_data_validation-1.9.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for tensorflow_data_validation-1.9.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d7b12cbfcca399243c5de6e1c09308ca79a67a4bcb1b9df02bedc0672609a2c1
MD5 09d566beba3a863c822d747494196094
BLAKE2b-256 6d71bfb9a28875521e44c3712fdaf8f14dfe484db8d22aac9ae0ed93a9029e76

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page