mycloud

Work distribution for small clusters.

These details have not been verified by PyPI

Project links

Homepage

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

MyCloud

Leverage small clusters of machines to increase your productivity.

MyCloud requires no prior setup; if you can SSH to your machines, then it will work out of the box. MyCloud currently exports a simple mapreduce API with several common input formats; adding support for your own is easy as well.

Usage

Starting your cluster:

import mycloud

cluster = mycloud.Cluster(['machine1', 'machine2'])

# or use defaults from ~/.config/mycloud
# cluster = mycloud.Cluster()

Map over a list:

result = cluster.map(compute_factors, range(1000))

ClientFS makes accessing local files seamless!

def my_worker(filename):
  do_work(mycloud.fs.FS.open(filename, 'r'))

cluster.map(['client:///my/local/file'], my_worker)

Use the MapReduce interface to easily handle processing of larger datasets:

from mycloud.mapreduce import MapReduce, group
from mycloud.resource import CSV
input_desc = [CSV('client:///path/to/my_input_%d.csv') % i for i in range(100)]
output_desc = [CSV('client:///path/to/my_output_file.csv')]

def map_identity(kv_iter, output):
  for k, v in kv_iter:
    output(k, int(v[0]))

def reduce_sum(kv_iter, output):
  for k, values in group(kv_iter):
    output(k, sum(values))

mr = MapReduce(cluster, map_identity, reduce_sum, input_desc, output_desc)

result = mr.run()

for k, v in result[0].reader():
  print k, v

Performance

It is, keep in mind, written entirely in Python.

Some simple operations I’ve used it for (6 machines, 96 cores):

Sorting a billion numbers: ~5m
Preprocessing 1.3 million images (resizing and SIFT feature extraction): ~1 hour

Input formats

Mycloud has builtin support for processing the following file types:

LevelDB
CSV
Text (lines)
Zip

Adding support for your own is simple - just write a resource class describing how to get a reader and writer. (see resource.py for details).

Why?!?

Sometimes you’re developing something in Python (because that’s what you do), and you decide you’d like it to be parallelized. Our current options are multiprocessing (limiting us to a single machine) and Hadoop streaming (limiting us to strings and Hadoop’s input formats).

Also, because I could.

Credits

MyCloud builds on the phenomonally useful cloud serialization, SSH/Paramiko, and LevelDB libraries.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.51

Jan 15, 2013

0.49

Jan 1, 2013

0.48

Nov 16, 2012

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mycloud-0.51.tar.gz (12.0 kB view details)

Uploaded Jan 15, 2013 Source

File details

Details for the file mycloud-0.51.tar.gz.

File metadata

Download URL: mycloud-0.51.tar.gz
Upload date: Jan 15, 2013
Size: 12.0 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for mycloud-0.51.tar.gz
Algorithm	Hash digest
SHA256	`397326a15bef6b4f80d689939c44018eb0444807ee9b95cf46b48a01100d5f25`
MD5	`16cdbe7c0c8c03c32e4a6f1ad0a93870`
BLAKE2b-256	`9b74101e9410700f5c252e0c21ec28ae9fa65fabdf2dd8722f1cf0acc508a0a5`

See more details on using hashes here.

mycloud 0.51

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MyCloud

Usage

Performance

Input formats

Why?!?

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes