Skip to main content

Tensorflow on Spark, a scalable system for high-performance machine learning

Project description

# tensorspark
Running Tensorflow on Spark in the scalable, fast and compatible style

Tensorspark facilitates the researchers and programmer to easily write the regular Tensorflow programs and run Tensorflow on the Spark distributed computing paradigm. Tensorspark is innovated by the SparkSession, which parallelizes the Tensorflow sessions in different executors of Spark. SparkSession maintains a riable central parameter server, which synchronizes the machine learning model parameters periodically with the worker executors.

##Programming example
Tensorspark program is very easy to write if one is already familiar with Tensorflow. An complete example of writing the MNIST program can be checked out in src/example/spark_mnist.py.
```
#initialize the learning model exactly as Tensorflow
import tensorflow as tf
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

#Extra information to notify the SparkSession about the input/output tensor and the variables.
feed_name_list = [x.name, y_.name]
param_list = [W, b]

#Initialize the SparkSession and run it with the Spark RDD data.
spark_sess = sps.SparkSession(sc, sess, user='liangfengsid', name='spark_mnist', server_host='localhost', server_port=10080, sync_interval=100, batch_size=100)
spark_sess.run(train_step, feed_rdd=image_label_rdd, feed_name_list=feed_name_list, param_list=param_list, shuffle_within_partition=True)
```

##Brief Installation Instruction (Linux or Mac OS):

###Install Tensorflow in each computer
https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html

###Setup the Hadoop and Spark cluster
http://spark.apache.org

###Install TornadoWeb in each computer (Optional if the anaconda python is used).
http://www.tornadoweb.org/en/stable/

###Install TensorSpark:
```
$ easy_install tensorspark
```
or download the source at github, compile and install it via:
```
$ python setup.py build
$ python setup.py install
```

###Configure the Spark cluster for Tensorspark
In the Spark configuratino file, conf/spark-defaults.conf, add the following configuration information
```
The directory in HDFS to store the SparkSession temporary files
spark.hdfs.dir /data
The directory in the local computer to store the SparkSession temporary files
spark.tmp.dir /tmp
```

###Create the corresponding directory in HDFS configured in the previous step
```
bin/hadoop fs -mkdir /data
```

###Prepare the MNIST example data and upload them to HDFS
Download the MNIST train data file in this github under: src/MNIST_data/.

Upload them to HDFS:
```
hadoop fs -put MNIST_data/* /data
```

###Run the MNIST example
In the directory of Tensorspark/src, run Spark pyspark via the shell.
```
pyspark --deploy-mode=client
>>>import example.spark_mnist as mnist
>>>mnist.train(sc=sc, user='liangfengsid', name='mnist_try', server_host='localhost', server_port=10080, sync_interval=100, batch_size=100, num_partition=1, num_epoch=2)
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensorspark-1.0.1.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tensorspark-1.0.1-py2-none-any.whl (22.8 kB view details)

Uploaded Python 2

File details

Details for the file tensorspark-1.0.1.tar.gz.

File metadata

  • Download URL: tensorspark-1.0.1.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for tensorspark-1.0.1.tar.gz
Algorithm Hash digest
SHA256 c610d8ac893ccf419b11d926d7efa3de5a28ceb71f7bc1dbe4522a0b763663ae
MD5 86cdbc9731e21822dc0c2f53afa70729
BLAKE2b-256 7d413466c477cf51e748f3d85ed3c33803e35108575c0ada36bd831405149e0a

See more details on using hashes here.

File details

Details for the file tensorspark-1.0.1-py2-none-any.whl.

File metadata

File hashes

Hashes for tensorspark-1.0.1-py2-none-any.whl
Algorithm Hash digest
SHA256 5c0d3b64cb0712b63b08d5bb635b8142b5d0336ba76fa57e1bb894ac7af5a1b3
MD5 65c436a9bfc27efca8488e3179f39460
BLAKE2b-256 08643d5ac609d14ad64720079e82b9890702b80ecef4c27657b3d5a1b4ad607f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page