SOIL Software Development Kit
Project description
SOIL SDK
The SOIL SDK allows users to develop and test applications that run on top of SOIL and modules and data structures that run in it.
Documentation
The main documentation page is here: https://developer.amalfianalytics.com/
Quick start
Install
pip install soil-sdk
Authentication
soil login
Data Load
import soil
# To use data already indexed in Soil
data = soil.data(dataId)
import soil
import numpy as np
# Or numpy
d = np.array([[1,2,3,4], [5,6,7,8]])
# This will upload the data
data = soil.data(d)
Data transformation and data exploration
import soil
from soil.modules.preprocessing import row_filter
from soil.modules.itemsets import frequent_itemsets, hypergraph
from my_favourite_graph_library import draw_graph
...
data = soil.data(d)
rf1 = row_filter(data, age={'gt': 60})
rf2 = row_filter(rf1, diseases={'has': {'code': {'regexp': '401.*'}}})
fis = frequent_itemsets(rf2, min_support=10, max_itemset_size=2)
hg = hypergraph(fis)
subgraph = hg.get_data(center_node='401.09', distance=2)
draw_graph(subgraph)
Alternate dyplr style:
...
hg = soil.data(d) >>
row_filter(age={'gt': 60}) >>
row_filter(diseases={'has': {'code': {'regexp': '401.*'}}}) >>
frequent_itemsets(min_support=10, max_itemset_size=2) >>
hypergraph()
...
It is possible to mix custom code with pipelines.
import soil
from soil.modules.preprocessing import row_filter
from soil.modules.clustering import nb_clustering
from soil.modules.higher_order import predict
from soil.modules.statistics import statistics
...
@soil.modulify
def merge_clusters(clusters, cluster_ids=[]):
'''
Merge the clusters in cluster_ids into one.
'''
M = clusters.data.M
M['new'] = M.columns[cluster_ids].sum(axis=1)
M = df.drop(M.columns[cluster_ids], axis=1, inplace=True)
clusters.data.M = M
return clusters
data = soil.data(d)
clusters = nb_clustering(data, num_clusters=4)
merged_clusters = merge_clusters(clusters, ['0', '1'])
assigned = predict(merged_clusters, data, assigments_attribute='assigments')
per_cluster_mean_age = statistics(assigned,
operations=[{
fn: 'mean',
partition_variables: ['assigments'],
aggregation_variable: 'age'
}])
print(per_cluster_mean_age)
Dyplr style:
...
per_cluster_mean_age = nb_clustering(data, num_clusters=4) >>
merge_clusters(['0', '1']) >>
predict(None, data, assigments_attribute='assigments') >>
statistics(operations=[{
fn: 'mean',
partition_variables: ['assigments'],
aggregation_variable: 'age'
}])
...
Aliases
You can define soil.alias('my_alias', model)
aliases for your trained models to be called from another program. This comes handy in continuous learning environments where a new model is produced every day or hour and there is another service that does predictions in real-time.
def do_every_hour():
# Get the old model
old_model = soil.data('my_model')
# Retrieve the dataset with an alias we have set before
dataset = soil.data('my_dataset')
# Retrieve the data that has arrived in the last hour
new_data = row_filter({ 'date': { 'gte': 'now-1h'} }, dataset)
# Train the new model
new_model = a_continuous_training_algorithm(old_model, new_data)
# Set the alias
soil.alias('my_model', new_model)
Design
The SOIL sdk has two parts.
- SOIL library. To run computations in the SOIL platform. Basically a wrapper in top of the SOIL REST API.
- SOIL cli. A terminal client to do operations with the SOIL platform which include things like upload new modules, datasets and monitor them.
Use cases
The SDK must cover two use cases that can overlap.
- Build an app on top of SOIL using algorithms and data from the cloud.
- Create modules and data structures that will live in the cloud.
Build Documentation
cd docs/website
yarn install
yarn build
Publish a new version:
yarn run version x.y.z
Where x.y.z is the version name in semver.
Roadmap
MVP
- Run pipelines - Done
- Upload modules and data structures to the cloud - Done
- Upload data - Done
- soil cli with operations: login, init and run
- Logging API - Done
- Documentation - Done
Upcoming
- Pipeline basic parallelization
More stuff
- Expose parallelization API (be able to split modules in tasks)
- Federated learning API
- Modulify containers (the modules instead of code can be docker containers)
Similar tools
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.