SOIL Software Development Kit
Project description
SOIL SDK
The SOIL SDK allows users to develop and test applications that run on top of SOIL and modules and data structures that run in it.
Quick start
Install
pip install soil-sdk
Authentication
soil login
Data Load
import soil
# To use data already indexed in Soil
data = soil.data(dataId)
import soil
import numpy as np
# Or numpy
d = np.array([[1,2,3,4], [5,6,7,8]])
# This will upload the data
data = soil.data(d)
Data transformation and data exploration
import soil
from soil.modules.preprocessing import row_filter
from soil.modules.itemsets import frequent_itemsets, hypergraph
from my_favourite_graph_library import draw_graph
...
data = soil.data(d)
rf1 = row_filter(data, age={'gt': 60})
rf2 = row_filter(rf1, diseases={'has': {'code': {'regexp': '401.*'}}})
fis = frequent_itemsets(rf2, min_support=10, max_itemset_size=2)
hg = hypergraph(fis)
subgraph = hg.get_data(center_node='401.09', distance=2)
draw_graph(subgraph)
Alternate dyplr style:
...
hg = soil.data(d) >>
row_filter(age={'gt': 60}) >>
row_filter(diseases={'has': {'code': {'regexp': '401.*'}}}) >>
frequent_itemsets(min_support=10, max_itemset_size=2) >>
hypergraph()
...
It is possible to mix custom code with pipelines.
import soil
from soil.modules.preprocessing import row_filter
from soil.modules.clustering import nb_clustering
from soil.modules.higher_order import predict
from soil.modules.statistics import statistics
...
@soil.modulify
def merge_clusters(clusters, cluster_ids=[]):
'''
Merge the clusters in cluster_ids into one.
'''
M = clusters.data.M
M['new'] = M.columns[cluster_ids].sum(axis=1)
M = df.drop(M.columns[cluster_ids], axis=1, inplace=True)
clusters.data.M = M
return clusters
data = soil.data(d)
clusters = nb_clustering(data, num_clusters=4)
merged_clusters = merge_clusters(clusters, ['0', '1'])
assigned = predict(merged_clusters, data, assigments_attribute='assigments')
per_cluster_mean_age = statistics(assigned,
operations=[{
fn: 'mean',
partition_variables: ['assigments'],
aggregation_variable: 'age'
}])
print(per_cluster_mean_age)
Dyplr style:
...
per_cluster_mean_age = nb_clustering(data, num_clusters=4) >>
merge_clusters(['0', '1']) >>
predict(None, data, assigments_attribute='assigments') >>
statistics(operations=[{
fn: 'mean',
partition_variables: ['assigments'],
aggregation_variable: 'age'
}])
...
Aliases
You can define soil.alias('my_alias', model)
aliases for your trained models to be called from another program. This comes handy in continuous learning environments where a new model is produced every day or hour and there is another service that does predictions in real-time.
def do_every_hour():
# Get the old model
old_model = soil.data('my_model')
# Retrieve the dataset with an alias we have set before
dataset = soil.data('my_dataset')
# Retrieve the data that has arrived in the last hour
new_data = row_filter({ 'date': { 'gte': 'now-1h'} }, dataset)
# Train the new model
new_model = a_continuous_training_algorithm(old_model, new_data)
# Set the alias
soil.alias('my_model', new_model)
Design
The SOIL sdk will contain two parts.
- SOIL library. To run computations in the SOIL platform. Basically a wrapper in top of the SOIL REST API.
- SOIL cli. A terminal client to do operations with the SOIL platform which include things like upload new modules, datasets and monitor them.
Use cases
The SDK must cover two use cases that can overlap.
- Build an app on top of SOIL using algorithms and data from the cloud.
- Create modules and data structures that will live in the cloud.
Design principles
- It shouldn't matter where the data comes from. Disk, a DB, s3, ...
- Compatibility with typical data fromats. Numpy, Pandas, Python's (lists, dicts, generators)
- Lazy evaluation. Computation doesn't take place until the data is required. Pipelines run 100% in the cloud. If there are partial results they can be used if possible (not in the first version).
- Type annotations. https://docs.python.org/3/library/typing.html We will use them. It prevents silly (and non-silly) bugs.
- The import system is tweaked to import dynamically modules that exist in the cloud. It will seem that they are imported locally. https://docs.python.org/3/reference/import.html
- User modules will be annotated with @modulify. Uploading them temporarily to the cloud. https://github.com/cloudpipe/cloudpickle
- We will assume that the code comes from trusted sources. In time we could use this https://bytecodealliance.github.io/wasmtime/security-sandboxing.html
- The user can upload small chunks of data for testing using soil.data()
- User modules should be testable with unit testing using python's unittest.
- Integration tests of custom modules and SOIL applications should be also possible.
- In time it should be possible to mock SOIL for testing as well.
- Modules and data structures can be permanently uploaded to the cloud (possibly under a namespace).
- It should be possible to develop from an interactive notebook.
Build Documentation
(Still not working properly)
cd docs/website
yarn install
yarn build
Roadmap
MVP
- Run pipelines - Done
- Upload modules and data structures to the cloud - Done
- Upload data
- soil cli with operations: login, init and run
- Logging API
- Documentation
Upcoming
- Pipeline basic parallelization (using Dask)
More stuff
- Expose parallelization API (be able to split modules in tasks)
- Federated learning API
- Modulify containers (the modules instead of code can be docker containers)
Similar tools
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for soil_sdk-0.0.1.dev61-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1cda7314f0f89a3d083bd5b37ba825785227872a1e7f3f1909e303fa1d271577 |
|
MD5 | 4faad005e64b2d616dd358f1cba0f24e |
|
BLAKE2b-256 | dd663cab46d64d060067e2ff8b654ee9843c3648be53a8ef01407af2228b785d |