learningOrchestra python client
Project description
learningOrchestra client package
Installation
Ensure which you have the python 3 installed in your machine and run:
pip install learning_orchestra_cliet
Documentation
After downloading the package, import all classes:
from learning_orchestra_client import *
Create a Context object passing a ip from your cluster in constructor parameter:
cluster_ip = "34.95.222.197"
Context(cluster_ip)
After create a Context object, you will able to usage learningOrchestra, each learningOrchestra functionality is contained in your own class, therefore, to use a specific functionality, after you instantiate and configure Context class, you need instantiate and call the method class of interest, in below, there are all class and each class methods, also have an example of workflow using this package in a python code.
DatabaseApi
read_resume_files
read_resume_files(pretty_response=True)
pretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
read_file
read_file(filename, skip=0, limit=10, query={}, pretty_response=True)
filename
: filename of fileskip
: number of rows amount to skip in pagination (default0
)limit
: number of rows to return in pagination (default10
) (max setted in20
rows per request)query
: query to make in mongo (default empty query)pretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
create_file
create_file(filename, url, pretty_response=True)
filename
: filename of file to be createdurl
: url to csv filepretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
delete_file
delete_file(filename, pretty_response=True)
filename
: file filename to be deletedpretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
Projection
create_projection
create_projection(filename, projection_filename, fields, pretty_response=True)
filename
: filename of file to make projectionprojection_filename
: filename used to create projectionfields
: list with fields to make projectionpretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
DataTypeHandler
change_file_type
change_file_type(filename, fields_dict, pretty_response=True)
filename
: filename of filefields_dict
: dictionary withfield
:number
orfield
:string
keyspretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
Histogram
create_histogram
create_histogram(filename, histogram_filename, fields,
pretty_response=True)
filename
: filename of file to make histogramhistogram_filename
: filename used to create histogramfields
: list with fields to make histogrampretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
Tsne
create_image_plot
create_image_plot(tsne_filename, parent_filename,
label_name=None, pretty_response=True)
parent_filename
: filename of file to make histogramtsne_filename
: filename used to create image plotlabel_name
: label name to dataset with labeled tuples (defaultNone
, to datasets without labeled tuples)pretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
read_image_plot_filenames
read_image_plot_filenames(pretty_response=True)
pretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
read_image_plot
read_image_plot(tsne_filename, pretty_response=True)
- tsne_filename: filename of a created image plot
pretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
delete_image_plot
delete_image_plot(tsne_filename, pretty_response=True)
tsne_filename
: filename of a created image plotpretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
Pca
create_image_plot
create_image_plot(tsne_filename, parent_filename,
label_name=None, pretty_response=True)
parent_filename
: filename of file to make histogrampca_filename
: filename used to create image plotlabel_name
: label name to dataset with labeled tuples (defaultNone
, to datasets without labeled tuples)pretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
read_image_plot_filenames
read_image_plot_filenames(pretty_response=True)
pretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
read_image_plot
read_image_plot(pca_filename, pretty_response=True)
pca_filename
: filename of a created image plotpretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
delete_image_plot
delete_image_plot(pca_filename, pretty_response=True)
pca_filename
: filename of a created image plotpretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
ModelBuilder
create_model
create_model(training_filename, test_filename, preprocessor_code,
model_classificator, pretty_response=True)
training_filename
: filename to be used in trainingtest_filename
: filename to be used in testpreprocessor_code
: python3 code for pyspark preprocessing modelmodel_classificator
: list of initial from classificators to be used in modelpretty_response
: return indented string to visualization (defaultTrue
, ifFalse
, return dict)
model_classificator
lr
: LogisticRegressiondt
: DecisionTreeClassifierrf
: RandomForestClassifiergb
: Gradient-boosted tree classifiernb
: NaiveBayes
to send a request with LogisticRegression and NaiveBayes Classifiers:
create_model(training_filename, test_filename, preprocessor_code, ["lr", "nb"])
preprocessor_code environment
The python 3 preprocessing code must use the environment instances in bellow:
training_df
(Instantiated): Spark Dataframe instance training filenametesting_df
(Instantiated): Spark Dataframe instance testing filename
The preprocessing code must instantiate the variables in below, all instances must be transformed by pyspark VectorAssembler:
features_training
(Not Instantiated): Spark Dataframe instance for train the modelfeatures_evaluation
(Not Instantiated): Spark Dataframe instance for evaluating trained model accuracyfeatures_testing
(Not Instantiated): Spark Dataframe instance for testing the model
In case you don't want to evaluate the model, set features_evaluation
as None
.
Handy methods
self.fields_from_dataframe(dataframe, is_string)
This method returns string or number fields as a string list from a DataFrame.
dataframe
: DataFrame instanceis_string
: Boolean parameter, ifTrue
, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields.
learning_orchestra_client usage example
In below there is a python script using the package with titanic challengue datasets:
from learning_orchestra_client import *
cluster_ip = "34.95.187.26"
Context(cluster_ip)
database_api = DatabaseApi()
print(database_api.create_file(
"titanic_training",
"https://filebin.net/rpfdy8clm5984a4c/titanic_training.csv?t=gcnjz1yo"))
print(database_api.create_file(
"titanic_testing",
"https://filebin.net/mguee52ke97k0x9h/titanic_testing.csv?t=ub4nc1rc"))
print(database_api.read_resume_files())
projection = Projection()
non_required_columns = ["Name", "Ticket", "Cabin",
"Embarked", "Sex", "Initial"]
print(projection.create("titanic_training",
"titanic_training_projection",
non_required_columns))
print(projection.create("titanic_testing",
"titanic_testing_projection",
non_required_columns))
data_type_handler = DataTypeHandler()
type_fields = {
"Age": "number",
"Fare": "number",
"Parch": "number",
"PassengerId": "number",
"Pclass": "number",
"SibSp": "number"
}
print(data_type_handler.change_file_type(
"titanic_testing_projection",
type_fields))
type_fields["Survived"] = "number"
print(data_type_handler.change_file_type(
"titanic_training_projection",
type_fields))
preprocessing_code = '''
from pyspark.ml import Pipeline
from pyspark.sql.functions import (
mean, col, split,
regexp_extract, when, lit)
from pyspark.ml.feature import (
VectorAssembler,
StringIndexer
)
TRAINING_DF_INDEX = 0
TESTING_DF_INDEX = 1
training_df = training_df.withColumnRenamed('Survived', 'label')
testing_df = testing_df.withColumn('label', lit(0))
datasets_list = [training_df, testing_df]
for index, dataset in enumerate(datasets_list):
dataset = dataset.withColumn(
"Initial",
regexp_extract(col("Name"), "([A-Za-z]+)\.", 1))
datasets_list[index] = dataset
misspelled_initials = [
'Mlle', 'Mme', 'Ms', 'Dr',
'Major', 'Lady', 'Countess',
'Jonkheer', 'Col', 'Rev',
'Capt', 'Sir', 'Don'
]
correct_initials = [
'Miss', 'Miss', 'Miss', 'Mr',
'Mr', 'Mrs', 'Mrs',
'Other', 'Other', 'Other',
'Mr', 'Mr', 'Mr'
]
for index, dataset in enumerate(datasets_list):
dataset = dataset.replace(misspelled_initials, correct_initials)
datasets_list[index] = dataset
initials_age = {"Miss": 22,
"Other": 46,
"Master": 5,
"Mr": 33,
"Mrs": 36}
for index, dataset in enumerate(datasets_list):
for initial, initial_age in initials_age.items():
dataset = dataset.withColumn(
"Age",
when((dataset["Initial"] == initial) &
(dataset["Age"].isNull()), initial_age).otherwise(
dataset["Age"]))
datasets_list[index] = dataset
for index, dataset in enumerate(datasets_list):
dataset = dataset.na.fill({"Embarked": 'S'})
datasets_list[index] = dataset
for index, dataset in enumerate(datasets_list):
dataset = dataset.withColumn("Family_Size", col('SibSp')+col('Parch'))
dataset = dataset.withColumn('Alone', lit(0))
dataset = dataset.withColumn(
"Alone",
when(dataset["Family_Size"] == 0, 1).otherwise(dataset["Alone"]))
datasets_list[index] = dataset
text_fields = ["Sex", "Embarked", "Initial"]
for column in text_fields:
for index, dataset in enumerate(datasets_list):
dataset = StringIndexer(
inputCol=column, outputCol=column+"_index").\
fit(dataset).\
transform(dataset)
datasets_list[index] = dataset
non_required_columns = ["Name", "Embarked", "Sex", "Initial"]
for index, dataset in enumerate(datasets_list):
dataset = dataset.drop(*non_required_columns)
datasets_list[index] = dataset
training_df = datasets_list[TRAINING_DF_INDEX]
testing_df = datasets_list[TESTING_DF_INDEX]
assembler = VectorAssembler(
inputCols=training_df.columns[:],
outputCol="features")
assembler.setHandleInvalid('skip')
features_training = assembler.transform(training_df)
(features_training, features_evaluation) =\
features_training.randomSplit([0.8, 0.2], seed=33)
features_testing = assembler.transform(testing_df)
'''
model_builder = Model()
print(model_builder.create_model(
"titanic_training_projection",
"titanic_testing_projection",
preprocessing_code,
["lr", "dt", "gb", "rf", "nb"]))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for learning_orchestra_client-1.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b59fce8a93dafd65b35ab9f67a749b0fdac75739d78c2194c828f0da22e09d6 |
|
MD5 | 56b02d631c2ba0114580b9f645f35bb6 |
|
BLAKE2b-256 | 924daea7006e2776eb057417b3a965c4eef17ac36b931c7bb76de803911f4b5e |
Hashes for learning_orchestra_client-1.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c40f1c5b8ff450ac1774a1795801f0a7aa897c7bf57fb11de2f66febb6fdcfc |
|
MD5 | be68c5a925656462748477798dd6706a |
|
BLAKE2b-256 | c4bd603182e1b6f0104a52b3730751cc0cf410de5b6fe2f12bb379e73900ad90 |