Skip to main content

SOIL Software Development Kit

Project description

SOIL SDK

The SOIL SDK allows users to develop and test applications that run on top of SOIL and modules and data structures that run in it.

Quick start

Install

pip install soil-sdk

Authentication

soil login

Data Load

import soil

# To use data already indexed in Soil
data = soil.data(dataId)
import soil
import numpy as np

# Or numpy
d = np.array([[1,2,3,4], [5,6,7,8]])
# This will upload the data
data = soil.data(d)

Data transformation and data exploration

import soil
from soil.modules.preprocessing import row_filter
from soil.modules.itemsets import frequent_itemsets, hypergraph

from my_favourite_graph_library import draw_graph

...

data = soil.data(d)
rf1 = row_filter(data, age={'gt': 60})
rf2 = row_filter(rf1, diseases={'has': {'code': {'regexp': '401.*'}}})
fis = frequent_itemsets(rf2, min_support=10, max_itemset_size=2)
hg = hypergraph(fis)

subgraph = hg.get_data(center_node='401.09', distance=2)

draw_graph(subgraph)

Alternate dyplr style:

...
hg = soil.data(d) >>
  row_filter(age={'gt': 60}) >>
  row_filter(diseases={'has': {'code': {'regexp': '401.*'}}}) >>
  frequent_itemsets(min_support=10, max_itemset_size=2) >>
  hypergraph()
...

It is possible to mix custom code with pipelines.

import soil
from soil.modules.preprocessing import row_filter
from soil.modules.clustering import nb_clustering
from soil.modules.higher_order import predict
from soil.modules.statistics import statistics
...
@soil.modulify
def merge_clusters(clusters, cluster_ids=[]):
  '''
  Merge the clusters in cluster_ids into one.
  '''
  M = clusters.data.M
  M['new'] = M.columns[cluster_ids].sum(axis=1)
  M = df.drop(M.columns[cluster_ids], axis=1, inplace=True)
  clusters.data.M = M
  return clusters

data = soil.data(d)
clusters = nb_clustering(data, num_clusters=4)
merged_clusters = merge_clusters(clusters, ['0', '1'])
assigned = predict(merged_clusters, data, assigments_attribute='assigments')
per_cluster_mean_age = statistics(assigned,
  operations=[{
    fn: 'mean',
    partition_variables: ['assigments'],
    aggregation_variable: 'age'
  }])

print(per_cluster_mean_age)

Dyplr style:

...
per_cluster_mean_age = nb_clustering(data, num_clusters=4) >>
  merge_clusters(['0', '1']) >>
  predict(None, data, assigments_attribute='assigments') >>
  statistics(operations=[{
    fn: 'mean',
    partition_variables: ['assigments'],
    aggregation_variable: 'age'
  }])
...

Aliases

You can define soil.alias('my_alias', model) aliases for your trained models to be called from another program. This comes handy in continuous learning environments where a new model is produced every day or hour and there is another service that does predictions in real-time.

def do_every_hour():
  # Get the old model
  old_model = soil.data('my_model')
  # Retrieve the dataset with an alias we have set before
  dataset = soil.data('my_dataset')
  # Retrieve the data that has arrived in the last hour
  new_data = row_filter({ 'date': { 'gte': 'now-1h'} }, dataset)
  # Train the new model
  new_model = a_continuous_training_algorithm(old_model, new_data)
  # Set the alias
  soil.alias('my_model', new_model)

Design

The SOIL sdk will contain two parts.

  • SOIL library. To run computations in the SOIL platform. Basically a wrapper in top of the SOIL REST API.
  • SOIL cli. A terminal client to do operations with the SOIL platform which include things like upload new modules, datasets and monitor them.

Use cases

The SDK must cover two use cases that can overlap.

  • Build an app on top of SOIL using algorithms and data from the cloud.
  • Create modules and data structures that will live in the cloud.

Design principles

  • It shouldn't matter where the data comes from. Disk, a DB, s3, ...
  • Compatibility with typical data fromats. Numpy, Pandas, Python's (lists, dicts, generators)
  • Lazy evaluation. Computation doesn't take place until the data is required. Pipelines run 100% in the cloud. If there are partial results they can be used if possible (not in the first version).
  • Type annotations. https://docs.python.org/3/library/typing.html We will use them. It prevents silly (and non-silly) bugs.
  • The import system is tweaked to import dynamically modules that exist in the cloud. It will seem that they are imported locally. https://docs.python.org/3/reference/import.html
  • User modules will be annotated with @modulify. Uploading them temporarily to the cloud. https://github.com/cloudpipe/cloudpickle
  • We will assume that the code comes from trusted sources. In time we could use this https://bytecodealliance.github.io/wasmtime/security-sandboxing.html
  • The user can upload small chunks of data for testing using soil.data()
  • User modules should be testable with unit testing using python's unittest.
  • Integration tests of custom modules and SOIL applications should be also possible.
  • In time it should be possible to mock SOIL for testing as well.
  • Modules and data structures can be permanently uploaded to the cloud (possibly under a namespace).
  • It should be possible to develop from an interactive notebook.

Build Documentation

(Still not working properly)

cd docs/website
yarn install
yarn build

Roadmap

MVP

  • Run pipelines - Done
  • Upload modules and data structures to the cloud - Done
  • Upload data
  • soil cli with operations: login, init and run
  • Logging API
  • Documentation

Upcoming

  • Pipeline basic parallelization (using Dask)

More stuff

  • Expose parallelization API (be able to split modules in tasks)
  • Federated learning API
  • Modulify containers (the modules instead of code can be docker containers)

Similar tools

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soil-sdk-0.0.1.dev61.tar.gz (167.2 kB view hashes)

Uploaded Source

Built Distribution

soil_sdk-0.0.1.dev61-py3-none-any.whl (15.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page