TensorFlow IO
Project description
TensorFlow I/O
TensorFlow I/O is a collection of file systems and file formats that are not available in TensorFlow's built-in support. A full list of supported file systems and file formats by TensorFlow I/O can be found here.
The use of tensorflow-io is straightforward with keras. Below is an example to Get Started with TensorFlow with the data processing aspect replaced by tensorflow-io:
import tensorflow as tf
import tensorflow_io as tfio
# Read the MNIST data into the IODataset.
d_train = tfio.IODataset.from_mnist(
'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz')
# Shuffle the elements of the dataset.
d_train = d_train.shuffle(buffer_size=1024)
# By default image data is uint8, so convert to float32 using map().
d_train = d_train.map(lambda x, y: (tf.image.convert_image_dtype(x, tf.float32), y))
# prepare batches the data just like any other tf.data.Dataset
d_train = d_train.batch(32)
# Build the model.
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
# Compile the model.
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Fit the model.
model.fit(d_train, epochs=5, steps_per_epoch=200)
In the above MNIST example, the URL's
to access the dataset files are passed directly to the tfio.IODataset.from_mnist
API call.
This is due to the inherent support that tensorflow-io
provides for the HTTP
file system,
thus eliminating the need for downloading and saving datasets on a local directory.
NOTE: Since tensorflow-io
is able to detect and uncompress the MNIST dataset automatically if needed,
we can pass the URL's for the compressed files (gzip) to the API call as is.
Please check the official documentation for more detailed and interesting usages of the package.
Installation
Python Package
The tensorflow-io
Python package can be installed with pip directly using:
$ pip install tensorflow-io
People who are a little more adventurous can also try our nightly binaries:
$ pip install tensorflow-io-nightly
In addition to the pip packages, the docker images can be used to quickly get started.
For stable builds:
$ docker pull tfsigio/tfio:latest
$ docker run -it --rm --name tfio-latest tfsigio/tfio:latest
For nightly builds:
$ docker pull tfsigio/tfio:nightly
$ docker run -it --rm --name tfio-nightly tfsigio/tfio:nightly
R Package
Once the tensorflow-io
Python package has been successfully installed, you
can install the development version of the R package from GitHub via the following:
if (!require("remotes")) install.packages("remotes")
remotes::install_github("tensorflow/io", subdir = "R-package")
TensorFlow Version Compatibility
To ensure compatibility with TensorFlow, it is recommended to install a matching version of TensorFlow I/O according to the table below. You can find the list of releases here.
TensorFlow I/O Version | TensorFlow Compatibility | Release Date |
---|---|---|
0.16.0 | 2.3.x | Oct 23, 2020 |
0.15.0 | 2.3.x | Aug 03, 2020 |
0.14.0 | 2.2.x | Jul 08, 2020 |
0.13.0 | 2.2.x | May 10, 2020 |
0.12.0 | 2.1.x | Feb 28, 2020 |
0.11.0 | 2.1.x | Jan 10, 2020 |
0.10.0 | 2.0.x | Dec 05, 2019 |
0.9.1 | 2.0.x | Nov 15, 2019 |
0.9.0 | 2.0.x | Oct 18, 2019 |
0.8.1 | 1.15.x | Nov 15, 2019 |
0.8.0 | 1.15.x | Oct 17, 2019 |
0.7.2 | 1.14.x | Nov 15, 2019 |
0.7.1 | 1.14.x | Oct 18, 2019 |
0.7.0 | 1.14.x | Jul 14, 2019 |
0.6.0 | 1.13.x | May 29, 2019 |
0.5.0 | 1.13.x | Apr 12, 2019 |
0.4.0 | 1.13.x | Mar 01, 2019 |
0.3.0 | 1.12.0 | Feb 15, 2019 |
0.2.0 | 1.12.0 | Jan 29, 2019 |
0.1.0 | 1.12.0 | Dec 16, 2018 |
Development
IDE Setup
For instructions on how to configure Visual Studio Code for developing TensorFlow I/O, please refer to https://github.com/tensorflow/io/blob/master/docs/vscode.md
Lint
TensorFlow I/O's code conforms to Bazel Buildifier, Clang Format, Black, and Pyupgrade. Please use the following command to check the source code and identify lint issues:
$ bazel run //tools/lint:check
For Bazel Buildifier and Clang Format, the following command will automatically identify and fix any lint errors:
$ bazel run //tools/lint:lint
Alternatively, if you only want to perform lint check using individual linters,
then you can selectively pass black
, pyupgrade
, bazel
, or clang
to the above commands.
For example, a black
specific lint check can be done using:
$ bazel run //tools/lint:check -- black
Lint fix using Bazel Buildifier and Clang Format can be done using:
$ bazel run //tools/lint:lint -- bazel clang
Lint check using black
and pyupgrade
for an individual python file can be done using:
$ bazel run //tools/lint:check -- black pyupgrade -- tensorflow_io/core/python/ops/version_ops.py
Lint fix an individual python file with black and pyupgrade using:
$ bazel run //tools/lint:lint -- black pyupgrade -- tensorflow_io/core/python/ops/version_ops.py
Python
macOS
On macOS Catalina or higher, it is possible to build tensorflow-io with
system provided python 3 (3.7.3). Both tensorflow
and bazel
are needed.
NOTE: Xcode installation is needed as tensorflow-io requires Swift for accessing Apple's native AVFoundation APIs. Also there is a bug in macOS's native python 3.7.3 that could be fixed with https://github.com/tensorflow/tensorflow/issues/33183#issuecomment-554701214
#!/usr/bin/env bash
# Use following command to check if Xcode is correctly installed:
xcodebuild -version
# macOS's default python3 is 3.7.3
python3 --version
# Install Bazel version specified in .bazelversion
curl -OL https://github.com/bazelbuild/bazel/releases/download/$(cat .bazelversion)/bazel-$(cat .bazelversion)-installer-darwin-x86_64.sh
sudo bash -x -e bazel-$(cat .bazelversion)-installer-darwin-x86_64.sh
# Install tensorflow and configure bazel
sudo ./configure.sh
# Build shared libraries
bazel build -s --verbose_failures //tensorflow_io/...
# Once build is complete, shared libraries will be available in
# `bazel-bin/tensorflow_io/core/python/ops/` and it is possible
# to run tests with `pytest`, e.g.:
sudo python3 -m pip install pytest
TFIO_DATAPATH=bazel-bin python3 -m pytest -s -v tests/test_serialization_eager.py
NOTE: When running pytest, TFIO_DATAPATH=bazel-bin
has to be passed so that python can utilize the generated shared libraries after the build process.
Troubleshoot
If Xcode is installed, but $ xcodebuild -version
is not displaying the expected output, you might need to enable Xcode command line with the command:
$ xcode-select -s /Applications/Xcode.app/Contents/Developer
.
A terminal restart might be required for the changes to take effect.
Sample output:
$ xcodebuild -version
Xcode 11.6
Build version 11E708
Linux
Development of tensorflow-io on Linux is similar to macOS. The required packages are gcc, g++, git, bazel, and python 3. Newer versions of gcc or python, other than the default system installed versions might be required though.
Ubuntu 18.04/20.04
Ubuntu 18.04/20.04 requires gcc/g++, git, and python 3. The following will install dependencies and build the shared libraries on Ubuntu 18.04/20.04:
#!/usr/bin/env bash
# Install gcc/g++, git, unzip/curl (for bazel), and python3
sudo apt-get -y -qq update
sudo apt-get -y -qq install gcc g++ git unzip curl python3-pip
# Install Bazel version specified in .bazelversion
curl -sSOL https://github.com/bazelbuild/bazel/releases/download/$(cat .bazelversion)/bazel-$(cat .bazelversion)-installer-linux-x86_64.sh
sudo bash -x -e bazel-$(cat .bazelversion)-installer-linux-x86_64.sh
# Upgrade pip
sudo python3 -m pip install -U pip
# Install tensorflow and configure bazel
sudo ./configure.sh
# Build shared libraries
bazel build -s --verbose_failures //tensorflow_io/...
# Once build is complete, shared libraries will be available in
# `bazel-bin/tensorflow_io/core/python/ops/` and it is possible
# to run tests with `pytest`, e.g.:
sudo python3 -m pip install pytest
TFIO_DATAPATH=bazel-bin python3 -m pytest -s -v tests/test_serialization_eager.py
CentOS 8
CentOS 8 requires gcc/g++, git, and python 3. The following will install dependencies and build the shared libraries on CentOS 8:
#!/usr/bin/env bash
# Install gcc/g++, git, unzip/which (for bazel), and python3
sudo yum install -y python3 python3-devel gcc gcc-c++ git unzip which
# Install Bazel version specified in .bazelversion
curl -sSOL https://github.com/bazelbuild/bazel/releases/download/$(cat .bazelversion)/bazel-$(cat .bazelversion)-installer-linux-x86_64.sh
sudo bash -x -e bazel-$(cat .bazelversion)-installer-linux-x86_64.sh
# Upgrade pip
sudo python3 -m pip install -U pip
# Install tensorflow and configure bazel
sudo ./configure.sh
# Build shared libraries
bazel build -s --verbose_failures //tensorflow_io/...
# Once build is complete, shared libraries will be available in
# `bazel-bin/tensorflow_io/core/python/ops/` and it is possible
# to run tests with `pytest`, e.g.:
sudo python3 -m pip install pytest
TFIO_DATAPATH=bazel-bin python3 -m pytest -s -v tests/test_serialization_eager.py
CentOS 7
On CentOS 7, the default python and gcc version are too old to build tensorflow-io's shared libraries (.so). The gcc provided by Developer Toolset and rh-python36 should be used instead. Also, the libstdc++ has to be linked statically to avoid discrepancy of libstdc++ installed on CentOS vs. newer gcc version by devtoolset.
The following will install bazel, devtoolset-9, rh-python36, and build the shared libraries:
#!/usr/bin/env bash
# Install centos-release-scl, then install gcc/g++ (devtoolset), git, and python 3
sudo yum install -y centos-release-scl
sudo yum install -y devtoolset-9 git rh-python36
# Install Bazel version specified in .bazelversion
curl -sSOL https://github.com/bazelbuild/bazel/releases/download/$(cat .bazelversion)/bazel-$(cat .bazelversion)-installer-linux-x86_64.sh
sudo bash -x -e bazel-$(cat .bazelversion)-installer-linux-x86_64.sh
# Upgrade pip
scl enable rh-python36 devtoolset-9 \
'python3 -m pip install -U pip'
# Install tensorflow and configure bazel with rh-python36
scl enable rh-python36 devtoolset-9 \
'./configure.sh'
# Build shared libraries
BAZEL_LINKOPTS="-static-libstdc++ -static-libgcc" BAZEL_LINKLIBS="-lm -l%:libstdc++.a" \
scl enable rh-python36 devtoolset-9 \
'bazel build -s --verbose_failures //tensorflow_io/...'
# Once build is complete, shared libraries will be available in
# `bazel-bin/tensorflow_io/core/python/ops/` and it is possible
# to run tests with `pytest`, e.g.:
scl enable rh-python36 devtoolset-9 \
'python3 -m pip install pytest'
TFIO_DATAPATH=bazel-bin \
scl enable rh-python36 devtoolset-9 \
'python3 -m pytest -s -v tests/test_serialization_eager.py'
Python Wheels
It is possible to build python wheels after bazel build is complete with the following command:
$ python3 setup.py bdist_wheel --data bazel-bin
The .whl file will be available in dist directory. Note the bazel binary directory bazel-bin
has to be passed with --data
args in order for setup.py to locate the necessary share objects,
as bazel-bin
is outside of the tensorflow_io
package directory.
Alternatively, source install could be done with:
$ TFIO_DATAPATH=bazel-bin python3 -m pip install .
with TFIO_DATAPATH=bazel-bin
passed for the same reason.
Note installing with -e
is different from the above. The
$ TFIO_DATAPATH=bazel-bin python3 -m pip install -e .
will not install shared object automatically even with TFIO_DATAPATH=bazel-bin
. Instead,
TFIO_DATAPATH=bazel-bin
has to be passed everytime the program is run after the install:
$ TFIO_DATAPATH=bazel-bin python3
>>> import tensorflow_io as tfio
>>> ...
Docker
For Python development, a reference Dockerfile here can be
used to build the TensorFlow I/O package (tensorflow-io
) from source. Additionally, the
pre-built devel images can be used as well:
# Pull (if necessary) and start the devel container
$ docker run -it --rm --name tfio-dev --net=host -v ${PWD}:/v -w /v tfsigio/tfio:latest-devel bash
# Inside the docker container, ./configure.sh will install TensorFlow or use existing install
(tfio-dev) root@docker-desktop:/v$ ./configure.sh
# Clean up exisiting bazel build's (if any)
(tfio-dev) root@docker-desktop:/v$ rm -rf bazel-*
# Build TensorFlow I/O C++. For compilation optimization flags, the default (-march=native)
# optimizes the generated code for your machine's CPU type.
# Reference: https://www.tensorflow.orginstall/source#configuration_options).
# NOTE: Based on the available resources, please change the number of job workers to:
# -j 4/8/16 to prevent bazel server terminations and resource oriented build errors.
(tfio-dev) root@docker-desktop:/v$ bazel build -j 8 --copt=-msse4.2 --copt=-mavx --compilation_mode=opt --verbose_failures --test_output=errors --crosstool_top=//third_party/toolchains/gcc7_manylinux2010:toolchain //tensorflow_io/...
# Run tests with PyTest, note: some tests require launching additional containers to run (see below)
(tfio-dev) root@docker-desktop:/v$ pytest -s -v tests/
# Build the TensorFlow I/O package
(tfio-dev) root@docker-desktop:/v$ python setup.py bdist_wheel
A package file dist/tensorflow_io-*.whl
will be generated after a build is successful.
NOTE: When working in the Python development container, an environment variable
TFIO_DATAPATH
is automatically set to point tensorflow-io to the shared C++
libraries built by Bazel to run pytest
and build the bdist_wheel
. Python
setup.py
can also accept --data [path]
as an argument, for example
python setup.py --data bazel-bin bdist_wheel
.
NOTE: While the tfio-dev container gives developers an easy to work with environment, the released whl packages are built differently due to manylinux2010 requirements. Please check [Build Status and CI] section for more details on how the released whl packages are generated.
Starting Test Containers
Some tests require launching a test container before running. In order to run all tests, execute the following commands:
$ bash -x -e tests/test_ignite/start_ignite.sh
$ bash -x -e tests/test_kafka/kafka_test.sh
$ bash -x -e tests/test_kinesis/kinesis_test.sh
R
We provide a reference Dockerfile here for you so that you can use the R package directly for testing. You can build it via:
$ docker build -t tfio-r-dev -f R-package/scripts/Dockerfile .
Inside the container, you can start your R session, instantiate a SequenceFileDataset
from an example Hadoop SequenceFile
string.seq, and then use any transformation functions provided by tfdatasets package on the dataset like the following:
library(tfio)
dataset <- sequence_file_dataset("R-package/tests/testthat/testdata/string.seq") %>%
dataset_repeat(2)
sess <- tf$Session()
iterator <- make_iterator_one_shot(dataset)
next_batch <- iterator_get_next(iterator)
until_out_of_range({
batch <- sess$run(next_batch)
print(batch)
})
Contributing
Tensorflow I/O is a community led open source project. As such, the project depends on public contributions, bug-fixes, and documentation. Please see contribution guidelines for a guide on how to contribute.
Build Status and CI
Build | Status |
---|---|
Linux CPU Python 2 | |
Linux CPU Python 3 | |
Linux GPU Python 2 | |
Linux GPU Python 3 |
Because of manylinux2010 requirement, TensorFlow I/O is built with Ubuntu:16.04 + Developer Toolset 7 (GCC 7.3) on Linux. Configuration with Ubuntu 16.04 with Developer Toolset 7 is not exactly straightforward. If the system have docker installed, then the following command will automatically build manylinux2010 compatible whl package:
#!/usr/bin/env bash
ls dist/*
for f in dist/*.whl; do
docker run -i --rm -v $PWD:/v -w /v --net=host quay.io/pypa/manylinux2010_x86_64 bash -x -e /v/tools/build/auditwheel repair --plat manylinux2010_x86_64 $f
done
sudo chown -R $(id -nu):$(id -ng) .
ls wheelhouse/*
It takes some time to build, but once complete, there will be python
3.5
, 3.6
, 3.7
compatible whl packages available in wheelhouse
directory.
On macOS, the same command could be used though the script expect python
in shell
and will only generate a whl package that matches the version of python
in shell. If
you want to build a whl package for a specific python then you have to alias this version
of python to python
in shell. See .github/workflows/build.yml
Auditwheel step for instructions how to do that.
Note the above command is also the command we use when releasing packages for Linux and macOS.
TensorFlow I/O uses both GitHub Workflows and Google CI (Kokoro) for continuous integration. GitHub Workflows is used for macOS build and test. Kokoro is used for Linux build and test. Again, because of the manylinux2010 requirement, on Linux whl packages are always built with Ubuntu 16.04 + Developer Toolset 7. Tests are done on a variatiy of systems with different python version to ensure a good coverage:
Python | Ubuntu 16.04 | Ubuntu 18.04 | macOS + osx9 |
---|---|---|---|
2.7 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
3.5 | :heavy_check_mark: | N/A | :heavy_check_mark: |
3.6 | N/A | :heavy_check_mark: | :heavy_check_mark: |
3.7 | N/A | :heavy_check_mark: | N/A |
TensorFlow I/O has integrations with may systems and cloud vendors such as Prometheus, Apache Kafka, Apache Ignite, Google Cloud PubSub, AWS Kinesis, Microsoft Azure Storage, Alibaba Cloud OSS etc.
We tried our best to test against those systems in our continuous integration whenever possible. Some tests such as Prometheus, Kafka, and Ignite are done with live systems, meaning we install Prometheus/Kafka/Ignite on CI machine before the test is run. Some tests such as Kinesis, PubSub, and Azure Storage are done through official or non-official emulators. Offline tests are also performed whenever possible, though systems covered through offine tests may not have the same level of coverage as live systems or emulators.
Live System | Emulator | CI Integration | Offline | |
---|---|---|---|---|
Apache Kafka | :heavy_check_mark: | :heavy_check_mark: | ||
Apache Ignite | :heavy_check_mark: | :heavy_check_mark: | ||
Prometheus | :heavy_check_mark: | :heavy_check_mark: | ||
Google PubSub | :heavy_check_mark: | :heavy_check_mark: | ||
Azure Storage | :heavy_check_mark: | :heavy_check_mark: | ||
AWS Kinesis | :heavy_check_mark: | :heavy_check_mark: | ||
Alibaba Cloud OSS | :heavy_check_mark: | |||
Google BigTable/BigQuery | to be added |
Note:
- Official PubSub Emulator by Google Cloud for Cloud PubSub.
- Official Azurite Emulator by Azure for Azure Storage.
- None-official LocalStack emulator by LocalStack for AWS Kinesis.
Community
- SIG IO Google Group and mailing list: io@tensorflow.org
- SIG IO Monthly Meeting Notes
- Gitter room: tensorflow/sig-io
More Information
- Streaming Machine Learning with Tiered Storage and Without a Data Lake - Kai Waehner
- TensorFlow with Apache Arrow Datasets - Bryan Cutler
- How to build a custom Dataset for Tensorflow - Ivelin Ivanov
- TensorFlow on Apache Ignite - Anton Dmitriev
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for tensorflow_io-0.16.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 427b099eb17f70872c0f7dbe81360b88a93b93d57e1c3135dbbff682f5ae2f46 |
|
MD5 | 54d4b362f8e2c27892b728d557488c93 |
|
BLAKE2b-256 | 45c276ac203febced05f4bdd19c172ecf4117ea1ba77d623846f073a1bf41266 |
Hashes for tensorflow_io-0.16.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29a343c7921f4a9d880ae7ad5b200c4dda2e1affc857e8e05087383f18a41eb2 |
|
MD5 | e638b8c6ec323dabf74676193bea48b1 |
|
BLAKE2b-256 | 519e23c5abaac13707c2b1de4ecf8267615f7c2e117e6fc2fb78d3c8af43c99d |
Hashes for tensorflow_io-0.16.0-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba39848f61dfdaffaf8df35814d339316bcb8272d2fa6aff8552c4e9c8c4c278 |
|
MD5 | d1df0e551e62f41b50d969b6f51685e7 |
|
BLAKE2b-256 | a9962ff0d53255ac749c4f4bcc73d01368ac8f09adf0d64b3b8707376f1a8a0a |
Hashes for tensorflow_io-0.16.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32b77ff85ceec5c6390d500dfaae41ef55c28b6497eeaf4ee3a9e84249fa0035 |
|
MD5 | d6d8d3b663c06cc7a0f20d9b5f5748f2 |
|
BLAKE2b-256 | b0cd1f85627fb15c276189631dbe3aaa6a0f8461cdd48f6edd0b07ba33552784 |
Hashes for tensorflow_io-0.16.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 638174fcd38a8c18090c67be0c47b060c95cbdeabf9e9823566f8805717d6acc |
|
MD5 | 3bddb4a77e799a0154b24999672945fa |
|
BLAKE2b-256 | 1e5290a542c8a07686cfb5ed9b5e01301061cb281e772543cf3a58bebd8c4488 |
Hashes for tensorflow_io-0.16.0-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b613a4129667b85b8cf09481e95babe15b7f597d8162c6614c359174f1aed44 |
|
MD5 | 9d52b34396e705f10f152707126a63d8 |
|
BLAKE2b-256 | bcade5e3f53dd44b9fd446489736096c1d19a1df683be04c1c15b766f7636fbe |
Hashes for tensorflow_io-0.16.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3c1088ea03dd1beb4ada99aa9319801dae32e01bdfd599a14e0a47a2d04c75c |
|
MD5 | 0c29900efa2db39145b45e9150db5217 |
|
BLAKE2b-256 | 2aa651cbb6b8c5fb89a2bf965ef607eb843becf13304b5c2b2daa88a201d6247 |
Hashes for tensorflow_io-0.16.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3878516cfa9d7eb69dd4039ef6565e6d5912782ddbcc1e027064c85d493b516 |
|
MD5 | fa87c8bb688efc41bb5cd8f4193bd317 |
|
BLAKE2b-256 | 4eeaf8c5a03c97273adc09a6f2e67420614f07249553ea9d224782a4cbaee523 |
Hashes for tensorflow_io-0.16.0-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f558f6f8add7c061a7359e49fa7fd16c3b20fe19b432e0b5dcdcf1b097fad75 |
|
MD5 | 0861d9e86704f85bbb7ea5d52d23afc6 |
|
BLAKE2b-256 | b576860b1650f427ddf783bb9f0570b11ee6729dd68c5b2300812301f10ac040 |