learningOrchestra python client
Project description
learningOrchestra-python-client
Python client for learningOrchestra.
Installation
Requires Python 3.x
pip install learning-orchestra-client
Usage
Import learning_orchestra_client:
from learning_orchestra_client import *
Create a Context
object passing an IP from your cluster :
cluster_ip = "34.95.222.197"
Context(cluster_ip)
After creating the Context
object, you will be able to use learningOrchestra.
Each functionality in learningOrchestra is contained in its own class. Check below for all the available function APIs.
Example
Shown below is an example usage of learning-orchestra-client using the Titanic Dataset:
from learning_orchestra_client import *
cluster_ip = "34.95.187.26"
Context(cluster_ip)
database_api = DatabaseApi()
print(database_api.create_file(
"titanic_training",
"https://filebin.net/rpfdy8clm5984a4c/titanic_training.csv?t=gcnjz1yo"))
print(database_api.create_file(
"titanic_testing",
"https://filebin.net/mguee52ke97k0x9h/titanic_testing.csv?t=ub4nc1rc"))
print(database_api.read_resume_files())
projection = Projection()
non_required_columns = ["Name", "Ticket", "Cabin",
"Embarked", "Sex", "Initial"]
print(projection.create("titanic_training",
"titanic_training_projection",
non_required_columns))
print(projection.create("titanic_testing",
"titanic_testing_projection",
non_required_columns))
data_type_handler = DataTypeHandler()
type_fields = {
"Age": "number",
"Fare": "number",
"Parch": "number",
"PassengerId": "number",
"Pclass": "number",
"SibSp": "number"
}
print(data_type_handler.change_file_type(
"titanic_testing_projection",
type_fields))
type_fields["Survived"] = "number"
print(data_type_handler.change_file_type(
"titanic_training_projection",
type_fields))
preprocessing_code = '''
from pyspark.ml import Pipeline
from pyspark.sql.functions import (
mean, col, split,
regexp_extract, when, lit)
from pyspark.ml.feature import (
VectorAssembler,
StringIndexer
)
TRAINING_DF_INDEX = 0
TESTING_DF_INDEX = 1
training_df = training_df.withColumnRenamed('Survived', 'label')
testing_df = testing_df.withColumn('label', lit(0))
datasets_list = [training_df, testing_df]
for index, dataset in enumerate(datasets_list):
dataset = dataset.withColumn(
"Initial",
regexp_extract(col("Name"), "([A-Za-z]+)\.", 1))
datasets_list[index] = dataset
misspelled_initials = [
'Mlle', 'Mme', 'Ms', 'Dr',
'Major', 'Lady', 'Countess',
'Jonkheer', 'Col', 'Rev',
'Capt', 'Sir', 'Don'
]
correct_initials = [
'Miss', 'Miss', 'Miss', 'Mr',
'Mr', 'Mrs', 'Mrs',
'Other', 'Other', 'Other',
'Mr', 'Mr', 'Mr'
]
for index, dataset in enumerate(datasets_list):
dataset = dataset.replace(misspelled_initials, correct_initials)
datasets_list[index] = dataset
initials_age = {"Miss": 22,
"Other": 46,
"Master": 5,
"Mr": 33,
"Mrs": 36}
for index, dataset in enumerate(datasets_list):
for initial, initial_age in initials_age.items():
dataset = dataset.withColumn(
"Age",
when((dataset["Initial"] == initial) &
(dataset["Age"].isNull()), initial_age).otherwise(
dataset["Age"]))
datasets_list[index] = dataset
for index, dataset in enumerate(datasets_list):
dataset = dataset.na.fill({"Embarked": 'S'})
datasets_list[index] = dataset
for index, dataset in enumerate(datasets_list):
dataset = dataset.withColumn("Family_Size", col('SibSp')+col('Parch'))
dataset = dataset.withColumn('Alone', lit(0))
dataset = dataset.withColumn(
"Alone",
when(dataset["Family_Size"] == 0, 1).otherwise(dataset["Alone"]))
datasets_list[index] = dataset
text_fields = ["Sex", "Embarked", "Initial"]
for column in text_fields:
for index, dataset in enumerate(datasets_list):
dataset = StringIndexer(
inputCol=column, outputCol=column+"_index").\
fit(dataset).\
transform(dataset)
datasets_list[index] = dataset
non_required_columns = ["Name", "Embarked", "Sex", "Initial"]
for index, dataset in enumerate(datasets_list):
dataset = dataset.drop(*non_required_columns)
datasets_list[index] = dataset
training_df = datasets_list[TRAINING_DF_INDEX]
testing_df = datasets_list[TESTING_DF_INDEX]
assembler = VectorAssembler(
inputCols=training_df.columns[:],
outputCol="features")
assembler.setHandleInvalid('skip')
features_training = assembler.transform(training_df)
(features_training, features_evaluation) =\
features_training.randomSplit([0.8, 0.2], seed=33)
features_testing = assembler.transform(testing_df)
'''
model_builder = Model()
print(model_builder.create_model(
"titanic_training_projection",
"titanic_testing_projection",
preprocessing_code,
["lr", "dt", "gb", "rf", "nb"]))
Function APIs
Database API
read_resume_files
read_resume_files(pretty_response=True)
pretty_response
: returns indentedstring
for visualization(default:True
, returnsdict
ifFalse
) (defaultTrue
, ifFalse
, return dict)
read_file
read_file(filename, skip=0, limit=10, query={}, pretty_response=True)
filename
: name of fileskip
: number of rows to skip in pagination(default:0
)limit
: number of rows to return in pagination(default:10
) (maximum is set at20
rows per request)query
: query to make in MongoDB(default:empty query
)pretty_response
: returns indentedstring
for visualization(default:True
, returnsdict
ifFalse
)
create_file
create_file(filename, url, pretty_response=True)
filename
: name of file to be createdurl
: url to CSV filepretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
delete_file
delete_file(filename, pretty_response=True)
filename
: name of the file to be deletedpretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
Projection API
create_projection
create_projection(filename, projection_filename, fields, pretty_response=True)
filename
: name of the file to make projectionprojection_filename
: name of file used to create projectionfields
: list with fields to make projectionpretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
Data type handler API
change_file_type
change_file_type(filename, fields_dict, pretty_response=True)
filename
: name of filefields_dict
: dictionary withfield
:number
orfield
:string
keyspretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
Histogram API
create_histogram
create_histogram(filename, histogram_filename, fields,
pretty_response=True)
filename
: name of file to make histogramhistogram_filename
: name of file used to create histogramfields
: list with fields to make histogrampretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
t-SNE API
create_image_plot
create_image_plot(tsne_filename, parent_filename,
label_name=None, pretty_response=True)
parent_filename
: name of file to make histogramtsne_filename
: name of file used to create image plotlabel_name
: label name to dataset with labeled tuples (default:None
, to datasets without labeled tuples)pretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
read_image_plot_filenames
read_image_plot_filenames(pretty_response=True)
pretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
read_image_plot
read_image_plot(tsne_filename, pretty_response=True)
- tsne_filename: filename of a created image plot
pretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
delete_image_plot
delete_image_plot(tsne_filename, pretty_response=True)
tsne_filename
: filename of a created image plotpretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
PCA API
create_image_plot
create_image_plot(tsne_filename, parent_filename,
label_name=None, pretty_response=True)
parent_filename
: name of file to make histogrampca_filename
: filename used to create image plotlabel_name
: label name to dataset with labeled tuples (default:None
, to datasets without labeled tuples)pretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
read_image_plot_filenames
read_image_plot_filenames(pretty_response=True)
pretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
read_image_plot
read_image_plot(pca_filename, pretty_response=True)
pca_filename
: filename of a created image plotpretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
delete_image_plot
delete_image_plot(pca_filename, pretty_response=True)
pca_filename
: filename of a created image plotpretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
Model builder API
create_model
create_model(training_filename, test_filename, preprocessor_code,
model_classificator, pretty_response=True)
training_filename
: name of file to be used in trainingtest_filename
: name of file to be used in testpreprocessor_code
: Python3 code for pyspark preprocessing modelmodel_classificator
: list of initial classificators to be used in modelpretty_response
: returns indentedstring
for visualization (default:True
, returnsdict
ifFalse
)
model_classificator
lr
: LogisticRegressiondt
: DecisionTreeClassifierrf
: RandomForestClassifiergb
: Gradient-boosted tree classifiernb
: NaiveBayes
to send a request with LogisticRegression and NaiveBayes Classifiers:
create_model(training_filename, test_filename, preprocessor_code, ["lr", "nb"])
preprocessor_code environment
The Python 3 preprocessing code must use the environment instances as below:
training_df
(Instantiated): Spark Dataframe instance training filenametesting_df
(Instantiated): Spark Dataframe instance testing filename
The preprocessing code must instantiate the variables as below, all instances must be transformed by pyspark VectorAssembler:
features_training
(Not Instantiated): Spark Dataframe instance for training the modelfeatures_evaluation
(Not Instantiated): Spark Dataframe instance for evaluating trained modelfeatures_testing
(Not Instantiated): Spark Dataframe instance for testing the model
In case you don't want to evaluate the model, set features_evaluation
as None
.
Handy methods
self.fields_from_dataframe(dataframe, is_string)
This method returns string
or number
fields as a string
list from a DataFrame.
dataframe
: DataFrame instanceis_string
: Boolean parameter(ifTrue
, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for learning_orchestra_client-1.0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed68d19e467a2f4b7a605876b387f310667e195a88749f0784c142aa33e0ea94 |
|
MD5 | 91d96781c56bc5e58f268f953f6a76d8 |
|
BLAKE2b-256 | 742e59d607f04c667849538c96a519d7c3db0d3274bc0de9a7f4f2d9927e646b |
Hashes for learning_orchestra_client-1.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ce4df560fc954c9f87aa90294ebdf204308b3962185eb8f4ab6e1df364a1e81 |
|
MD5 | 56f302981e54a61105c103a80552d8c5 |
|
BLAKE2b-256 | 3fab78cf114c3555652f39362d011dfd4227f68478d65451a6c5783183edf56a |