A light set of enablers based on Cloudframe's proprietary data science codebase.

These details have not been verified by PyPI

Project links

Homepage

Project description

The Cloudframe Data Scientist Simple Enabler

At Cloudframe we employ teams of Data Scientists, Data Engineers, and Software Developers. Check us out at http://cloudframe.io

If you're interested in joining our team as a Data Scientist see here: Bid Prediction Repo. There you'll find a fun problem and more info about our evergreen positions for Data Scientists, Data Engineers, and Software Developers.

This package contains some convenience functions meant help a Data Scientist get data into a format that is useful for training models. It is a light version of some of our proprietary enablers that we use to deliver data-informed products to our clients.

Installation

pip install datascientist

Dependencies

In addition to the following packages, datascientist requires that you have the credentials (et cetera) to perform the operation required. For example, when connecting to an Oracle database you must install and configure Instant Client or something like that. This package does not do that for you.

pandas
numpy
psycopg2
mysql.connector
cx_Oracle

Structure

data-scientist/
|
|-- connections/
|   |-- __init__.py
|   |-- connection_convenience.py
|   |-- rsconnect.py
|
|-- workflow/
|   |-- __init__.py
|   |-- tracker.py
|
|-- Manifest.in
|-- README.md
|-- setup.py
|-- bash_profile_example

Usage

`connections.connection_convenience`

A sample bash profile is provided for reference with values removed. Some of the functions will look for environment variables named according the conventions there. If it can't find them it will prompt you for the appropriate strings. Strings set via prompts are NOT saved for security reasons. It's up to you to make sure that if you set environment variables in a more permanent way that they remain secure.

This module replicates the functionality of pandas.read_sql(), but is a little friendlier; handling the connection object for you while performing the same according to %timeit.

import connections.connection_convenience as cc

sql = '''
select * from my_table
where my_field in ('cloud', 'frame');
'''

df = cc.pg2df(sql)

# input at the prompts if necessary

`connections.rsconnect`

This is a special case of connection_convenience for Redshift with a bunch more functionality. In addition to merely establishing connections and fetching data, this sub-module can perform do things like:

Infer the schema of your DataFrame
CREATE and DROP tables
WRITE data to a table
Perform an UPSERT operation
Get the names of tables in your cluster
Et cetera

For example, upsert data or write a new table:

import connections.rsconnect as rs

tname = 'my_table'

fields = rs.infer_schema(df)
bucket, key = rs.df_to_s3(df, 
                          bucket = 'my-bucket', 
                          key = 'location/on/s3/my-file.csv',
                          primary = 'my_primary_key')

if rs.table_check(tname):
    _ = rs.upsert_table(tname, 
                        fields, 
                        bucket = bucket,
                        key = key,
                        primary = 'my_primary_key')

else:
    _ = rs.create_table(tname, 
                        fields,
                        primary = 'my_primary_key')
    _ = rs.write_data(tname,
                      bucket,
                      key)

Note also that the function to fetch data is: rs.sql_to_df().

`workflow.tracker`

The workflow.tracker provides a lightweight tool for tracking a data science workflow. It is intended to help data scientists produce human-readable artifacts and obviate the need for things like complex naming conventions to keep track of the state of modeling experiments. It also has features to enable reproducibility, iterative improvment, and model deployent in a cloud environment (AWS right now).

The fundamental object of this library is the Project class. A Project is conceptually is a single effort to build a Machine Learning function to address a particular problem. Individual experiments are conceptualized as 'runs'. A Run covers the data science workflow from data conditioning (post ETL and feature generation) through model validation and testing.

For more information and to learn how to use the Workflow Tracker, see the sample notebooks in the 'cloud-event-modeling' repository.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.7

Oct 7, 2021

0.2.0

Jan 22, 2021

0.1.7

Jul 1, 2020

0.1.6

May 26, 2020

0.1.5 yanked

Apr 26, 2020

Reason this release was yanked:

Contains options that are no longer supported... see startaker

0.1.4

Apr 21, 2020

0.1.3

Apr 20, 2020

0.1.2

Apr 20, 2020

0.1.1

Apr 18, 2020

0.1.0

Apr 18, 2020

0.0.11

Mar 13, 2020

0.0.9

Feb 29, 2020

0.0.8

Jan 18, 2020

This version

0.0.7

Jan 17, 2020

0.0.3

Aug 29, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascientist-0.0.7.tar.gz (3.6 kB view hashes)

Uploaded Jan 17, 2020 Source

Built Distribution

datascientist-0.0.7-py3-none-any.whl (5.2 kB view hashes)

Uploaded Jan 17, 2020 Python 3

Hashes for datascientist-0.0.7.tar.gz

Hashes for datascientist-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`1659b9e26b94921f511dbf934f5d09bcee85da516522569043404699466a9640`
MD5	`80fdd301ecf87285ea29504e4d1b5b0b`
BLAKE2b-256	`0eee6a7162a2409d213e4b62b6f0c19482a3367813c1ea6cfba6e1d4597df6e4`

Hashes for datascientist-0.0.7-py3-none-any.whl

Hashes for datascientist-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fe41889052f2fea009051af0bb36f8716f45c50d9e5385ada9c11a24277666a1`
MD5	`2f42a79f0141ddbe17332f627b0a0a3f`
BLAKE2b-256	`b735ff33fe6be3cf8f9d9c7d2cf2ebc841b4333cfb4cb7081401edc5587a3c7b`