tensorspark

Tensorflow on Spark, a scalable system for high-performance machine learning

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

# tensorspark
Running Tensorflow on Spark in the scalable, fast and compatible style

Tensorspark facilitates the researchers and programmer to easily write the regular Tensorflow programs and run Tensorflow on the Spark distributed computing paradigm. Tensorspark is innovated by the SparkSession, which parallelizes the Tensorflow sessions in different executors of Spark. SparkSession maintains a riable central parameter server, which synchronizes the machine learning model parameters periodically with the worker executors.

##Programming example
Tensorspark program is very easy to write if one is already familiar with Tensorflow. An complete example of writing the MNIST program can be checked out in src/example/spark_mnist.py.
```
#initialize the learning model exactly as Tensorflow
import tensorflow as tf
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

#Extra information to notify the SparkSession about the input/output tensor and the variables.
feed_name_list = [x.name, y_.name]
param_list = [W, b]

#Initialize the SparkSession and run it with the Spark RDD data.
spark_sess = sps.SparkSession(sc, sess, user='liangfengsid', name='spark_mnist', server_host='localhost', server_port=10080, sync_interval=100, batch_size=100)
spark_sess.run(train_step, feed_rdd=image_label_rdd, feed_name_list=feed_name_list, param_list=param_list, shuffle_within_partition=True)
```

##Brief Installation Instruction (Linux or Mac OS):

###Install Tensorflow in each computer
https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html

###Setup the Hadoop and Spark cluster
http://spark.apache.org

###Install TornadoWeb in each computer (Optional if the anaconda python is used).
http://www.tornadoweb.org/en/stable/

###Install TensorSpark:
```
$ easy_install tensorspark
```
or download the source at github, compile and install it via:
```
$ python setup.py build
$ python setup.py install
```

###Configure the Spark cluster for Tensorspark
In the Spark configuratino file, conf/spark-defaults.conf, add the following configuration information
```
The directory in HDFS to store the SparkSession temporary files
spark.hdfs.dir /data
The directory in the local computer to store the SparkSession temporary files
spark.tmp.dir /tmp
```

###Create the corresponding directory in HDFS configured in the previous step
```
bin/hadoop fs -mkdir /data
```

###Prepare the MNIST example data and upload them to HDFS
Download the MNIST train data file in this github under: src/MNIST_data/.

Upload them to HDFS:
```
hadoop fs -put MNIST_data/* /data
```

###Run the MNIST example
In the directory of Tensorspark/src, run Spark pyspark via the shell.
```
pyspark --deploy-mode=client
>>>import example.spark_mnist as mnist
>>>mnist.train(sc=sc, user='liangfengsid', name='mnist_try', server_host='localhost', server_port=10080, sync_interval=100, batch_size=100, num_partition=1, num_epoch=2)
```

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.6

Oct 27, 2016

1.0.5

Oct 22, 2016

1.0.4

Aug 2, 2016

1.0.3

Aug 2, 2016

This version

1.0.1

Aug 1, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensorspark-1.0.1.tar.gz (16.0 kB view hashes)

Uploaded Aug 1, 2016 Source

Built Distribution

tensorspark-1.0.1-py2-none-any.whl (22.8 kB view hashes)

Uploaded Aug 1, 2016 Python 2

Hashes for tensorspark-1.0.1.tar.gz

Hashes for tensorspark-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`c610d8ac893ccf419b11d926d7efa3de5a28ceb71f7bc1dbe4522a0b763663ae`
MD5	`86cdbc9731e21822dc0c2f53afa70729`
BLAKE2b-256	`7d413466c477cf51e748f3d85ed3c33803e35108575c0ada36bd831405149e0a`

Hashes for tensorspark-1.0.1-py2-none-any.whl

Hashes for tensorspark-1.0.1-py2-none-any.whl
Algorithm	Hash digest
SHA256	`5c0d3b64cb0712b63b08d5bb635b8142b5d0336ba76fa57e1bb894ac7af5a1b3`
MD5	`65c436a9bfc27efca8488e3179f39460`
BLAKE2b-256	`08643d5ac609d14ad64720079e82b9890702b80ecef4c27657b3d5a1b4ad607f`