SOIL Software Development Kit

Project description

SOIL SDK

The SOIL SDK allows users to develop and test applications that run on top of SOIL and modules and data structures that run in it.

Quick start

Install

pip install soil-sdk

Authentication

soil login

Data Load

import soil

# To use data already indexed in Soil
data = soil.data(dataId)

import soil
import numpy as np

# Or numpy
d = np.array([[1,2,3,4], [5,6,7,8]])
# This will upload the data
data = soil.data(d)

Data transformation and data exploration

import soil
from soil.modules.preprocessing import row_filter
from soil.modules.itemsets import frequent_itemsets, hypergraph

from my_favourite_graph_library import draw_graph

...

data = soil.data(d)
rf1 = row_filter(data, age={'gt': 60})
rf2 = row_filter(rf1, diseases={'has': {'code': {'regexp': '401.*'}}})
fis = frequent_itemsets(rf2, min_support=10, max_itemset_size=2)
hg = hypergraph(fis)

subgraph = hg.get_data(center_node='401.09', distance=2)

draw_graph(subgraph)

Alternate dyplr style:

...
hg = soil.data(d) >>
  row_filter(age={'gt': 60}) >>
  row_filter(diseases={'has': {'code': {'regexp': '401.*'}}}) >>
  frequent_itemsets(min_support=10, max_itemset_size=2) >>
  hypergraph()
...

It is possible to mix custom code with pipelines.

import soil
from soil.modules.preprocessing import row_filter
from soil.modules.clustering import nb_clustering
from soil.modules.higher_order import predict
from soil.modules.statistics import statistics
...
@soil.modulify
def merge_clusters(clusters, cluster_ids=[]):
  '''
  Merge the clusters in cluster_ids into one.
  '''
  M = clusters.data.M
  M['new'] = M.columns[cluster_ids].sum(axis=1)
  M = df.drop(M.columns[cluster_ids], axis=1, inplace=True)
  clusters.data.M = M
  return clusters

data = soil.data(d)
clusters = nb_clustering(data, num_clusters=4)
merged_clusters = merge_clusters(clusters, ['0', '1'])
assigned = predict(merged_clusters, data, assigments_attribute='assigments')
per_cluster_mean_age = statistics(assigned,
  operations=[{
    fn: 'mean',
    partition_variables: ['assigments'],
    aggregation_variable: 'age'
  }])

print(per_cluster_mean_age)

Dyplr style:

...
per_cluster_mean_age = nb_clustering(data, num_clusters=4) >>
  merge_clusters(['0', '1']) >>
  predict(None, data, assigments_attribute='assigments') >>
  statistics(operations=[{
    fn: 'mean',
    partition_variables: ['assigments'],
    aggregation_variable: 'age'
  }])
...

Aliases

You can define soil.alias('my_alias', model) aliases for your trained models to be called from another program. This comes handy in continuous learning environments where a new model is produced every day or hour and there is another service that does predictions in real-time.

def do_every_hour():
  # Get the old model
  old_model = soil.data('my_model')
  # Retrieve the dataset with an alias we have set before
  dataset = soil.data('my_dataset')
  # Retrieve the data that has arrived in the last hour
  new_data = row_filter({ 'date': { 'gte': 'now-1h'} }, dataset)
  # Train the new model
  new_model = a_continuous_training_algorithm(old_model, new_data)
  # Set the alias
  soil.alias('my_model', new_model)

Design

The SOIL sdk will contain two parts.

SOIL library. To run computations in the SOIL platform. Basically a wrapper in top of the SOIL REST API.
SOIL cli. A terminal client to do operations with the SOIL platform which include things like upload new modules, datasets and monitor them.

Use cases

The SDK must cover two use cases that can overlap.

Build an app on top of SOIL using algorithms and data from the cloud.
Create modules and data structures that will live in the cloud.

Design principles

It shouldn't matter where the data comes from. Disk, a DB, s3, ...
Compatibility with typical data fromats. Numpy, Pandas, Python's (lists, dicts, generators)
Lazy evaluation. Computation doesn't take place until the data is required. Pipelines run 100% in the cloud. If there are partial results they can be used if possible (not in the first version).
Type annotations. https://docs.python.org/3/library/typing.html We will use them. It prevents silly (and non-silly) bugs.
The import system is tweaked to import dynamically modules that exist in the cloud. It will seem that they are imported locally. https://docs.python.org/3/reference/import.html
User modules will be annotated with @modulify. Uploading them temporarily to the cloud. https://github.com/cloudpipe/cloudpickle
We will assume that the code comes from trusted sources. In time we could use this https://bytecodealliance.github.io/wasmtime/security-sandboxing.html
The user can upload small chunks of data for testing using soil.data()
User modules should be testable with unit testing using python's unittest.
Integration tests of custom modules and SOIL applications should be also possible.
In time it should be possible to mock SOIL for testing as well.
Modules and data structures can be permanently uploaded to the cloud (possibly under a namespace).
It should be possible to develop from an interactive notebook.

Build Documentation

(Still not working properly)

cd docs/website
yarn install
yarn build

Roadmap

MVP

Run pipelines - Done
Upload modules and data structures to the cloud - Done
Upload data
soil cli with operations: login, init and run
Logging API
Documentation

Upcoming

Pipeline basic parallelization (using Dask)

More stuff

Expose parallelization API (be able to split modules in tasks)
Federated learning API
Modulify containers (the modules instead of code can be docker containers)

Similar tools

Project details

Release history Release notifications | RSS feed

0.6.26

Mar 20, 2024

0.6.25

Jan 23, 2024

0.6.24

Jan 23, 2024

0.6.23

Jan 22, 2024

0.6.22

Jan 11, 2024

0.6.21

Dec 4, 2023

0.6.21.dev7 pre-release

Dec 4, 2023

0.6.20

Oct 9, 2023

0.6.19

Oct 9, 2023

0.6.18

Oct 3, 2023

0.6.17

Sep 19, 2023

0.6.16

Aug 30, 2023

0.6.15

Aug 8, 2023

0.6.14

Jun 14, 2023

0.6.13

Jun 8, 2023

0.6.12

Jun 7, 2023

0.6.12.dev1 pre-release

Jun 7, 2023

0.6.11

Jun 7, 2023

0.6.10

May 11, 2023

0.6.10.dev1 pre-release

May 11, 2023

0.6.9

Mar 22, 2023

0.6.8

Mar 10, 2023

0.6.8.dev2 pre-release

Mar 10, 2023

0.6.6

Feb 17, 2023

0.6.5

Feb 6, 2023

0.6.4

Jan 24, 2023

0.6.3

Jan 24, 2023

0.6.2

Jan 23, 2023

0.6.1

Sep 28, 2022

0.6.1.dev2 pre-release

Sep 13, 2022

0.6.0

Sep 12, 2022

0.5.0

Sep 6, 2022

0.4.0

Jun 15, 2022

0.3.15

May 18, 2022

0.3.14

Mar 21, 2022

0.3.13

Feb 23, 2022

0.3.12

Feb 23, 2022

0.3.11

Feb 22, 2022

0.3.10

Feb 9, 2022

0.3.8

Nov 18, 2021

0.3.6

Nov 10, 2021

0.3.5

Nov 9, 2021

0.3.4

Oct 19, 2021

0.3.3

Sep 23, 2021

0.3.3.dev9 pre-release

Sep 23, 2021

0.3.2

Sep 6, 2021

0.3.1

Sep 2, 2021

0.3.0

Aug 18, 2021

0.2.1

Jul 6, 2021

0.1.6

May 13, 2021

0.1.5

May 5, 2021

0.1.4

Apr 16, 2021

0.1.3

Apr 7, 2021

0.1.3.dev3 pre-release

Apr 7, 2021

0.1.2

Mar 1, 2021

0.1.2.dev10 pre-release

Feb 26, 2021

0.1.1

Feb 1, 2021

0.0.3.dev8 pre-release

Jan 18, 2021

0.0.3.dev1 pre-release

Dec 11, 2020

0.0.1.dev112 pre-release

Dec 3, 2020

0.0.1.dev111 pre-release

Dec 11, 2020

0.0.1.dev105 pre-release

Nov 20, 2020

0.0.1.dev104 pre-release

Nov 20, 2020

0.0.1.dev103 pre-release

Nov 20, 2020

0.0.1.dev102 pre-release

Nov 20, 2020

0.0.1.dev101 pre-release

Nov 20, 2020

0.0.1.dev99 pre-release

Sep 15, 2020

0.0.1.dev97 pre-release

Sep 15, 2020

0.0.1.dev95 pre-release

Sep 15, 2020

0.0.1.dev89 pre-release

Aug 7, 2020

0.0.1.dev85 pre-release

Aug 7, 2020

0.0.1.dev84 pre-release

Aug 7, 2020

0.0.1.dev79 pre-release

Aug 7, 2020

0.0.1.dev76 pre-release

Aug 6, 2020

0.0.1.dev69 pre-release

Jul 24, 2020

0.0.1.dev68 pre-release

Jul 20, 2020

0.0.1.dev67 pre-release

Jul 20, 2020

This version

0.0.1.dev61 pre-release

Jul 20, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soil-sdk-0.0.1.dev61.tar.gz (167.2 kB view hashes)

Uploaded Jul 20, 2020 Source

Built Distribution

soil_sdk-0.0.1.dev61-py3-none-any.whl (15.0 kB view hashes)

Uploaded Jul 20, 2020 Python 3

Hashes for soil-sdk-0.0.1.dev61.tar.gz

Hashes for soil-sdk-0.0.1.dev61.tar.gz
Algorithm	Hash digest
SHA256	`5beb42ace672048e5cfcc72b6e49ac1842295db98dea40c0afc49f77c436a3c4`
MD5	`3666cd260d87c1318428e337d01844a2`
BLAKE2b-256	`a6af664b65daaabc5549c9a0b8681a8308558c7f09d04cb6383a486807624a7a`

Hashes for soil_sdk-0.0.1.dev61-py3-none-any.whl

Hashes for soil_sdk-0.0.1.dev61-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1cda7314f0f89a3d083bd5b37ba825785227872a1e7f3f1909e303fa1d271577`
MD5	`4faad005e64b2d616dd358f1cba0f24e`
BLAKE2b-256	`dd663cab46d64d060067e2ff8b654ee9843c3648be53a8ef01407af2228b785d`