A light set of enablers based on Cloudframe's proprietary data science codebase.
Project description
The Ephemerai Data Scientist Enabler
At Ephemerai we employ teams of Data Scientists, Data Engineers, and Software Developers. Check us out at http://ephemer.ai
If you're interested in joining our team as a Data Scientist see here: Bid Prediction Repo. There you'll find a fun problem and more info about our evergreen positions for Data Scientists, Data Engineers, and Software Developers.
This package contains some convenience functions meant help a Data Scientist:
- get data into a format that is useful for training models,
- track experiments as a natural workflow, and
- use common cloud resources like AWS S3.
It is a light version of some of our proprietary enablers that we use to deliver data-informed products to our clients. The workflow
sub-module contains tracker
which is intended to support data science experimentation.
Installation
pip install datascientist
Dependencies
In addition to the following packages, datascientist
requires that you have the credentials (et cetera) to perform the operation required. For example, when connecting to a Redshift database you must have the correct credentials stored either as environment variables (see the example bash profile) or in an rs_creds.json
file located in the home directory.
pandas
numpy
psycopg2
PyYAML
Structure
data-scientist/
|
|-- connections/
| |-- __init__.py
| |-- rsconnect.py
|
|-- workflow/
| |-- __init__.py
| |-- tracker.py
|
|-- special/
| |-- __init__.py
| |-- s3session.py
|
|-- Manifest.in
|-- README.md
|-- setup.py
|-- bash_profile_example
Usage
connections.rsconnect
A set of convenience functions for interacting with a Redshift database. In addition to merely establishing connections and fetching data, this sub-module can perform do things like:
- Infer the schema of your DataFrame
- CREATE and DROP tables
- WRITE data to a table
- Perform an UPSERT operation
- Get the names of tables in your cluster
- Et cetera
For example, upsert data or write a new table:
import connections.rsconnect as rs
### Store a local file to S3
bucket, key = rs.df_to_s3(
df,
bucket = 'my-bucket',
key = 'location/on/s3/my-file.csv',
primary = 'my_primary_key'
)
### If the table exists, perform an upsert operation from the CSV
### If it doesn't, create a table from the CSV
tname = 'my_table'
fields = rs.infer_schema(df)
if rs.table_check(tname):
_ = rs.upsert_table(
tname,
fields,
bucket = bucket,
key = key,
primary = 'my_primary_key'
)
else:
_ = rs.create_table(
tname,
fields,
primary = 'my_primary_key'
)
_ = rs.write_data(
tname,
bucket,
key
)
Note also that the function to fetch data is: rs.sql_to_df()
.
workflow.tracker
The workflow.tracker
provides a lightweight tool for tracking a data science workflow. It is intended to help data scientists produce human-readable artifacts and obviate the need for things like complex naming conventions to keep track of the state of modeling experiments. It also has features to enable reproducibility, iterative improvment, and model deployent in a cloud environment (AWS right now).
The fundamental object of this library is the Project
class. A Project conceptually is a single effort to build a Machine Learning function to address a particular problem. Individual experiments are conceptualized as 'runs'. A Run covers the data science workflow from data conditioning (post ETL and feature generation) through model validation and testing.
For more information and to learn how to use the Workflow Tracker, see the sample notebooks in the 'cloud-event-modeling' repository.
special.s3session
The special.s3session
module contains a set of convenience functions for creating an S3 session with credentials, checking a bucket's existence, listing a bucket's objects, and the like.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datascientist-0.2.7.tar.gz
.
File metadata
- Download URL: datascientist-0.2.7.tar.gz
- Upload date:
- Size: 18.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b85618637d2c50cbb148e1eeafc4ac13a6c34ded4e913b3e4e8c01c33b43a0f6 |
|
MD5 | 4a9634d00e3fc71547bf8601d97e4b61 |
|
BLAKE2b-256 | 8e99460645da24e598fc94fd9e7e01e70628a64b1cf018bb4c95a7f4a197dee8 |
File details
Details for the file datascientist-0.2.7-py3-none-any.whl
.
File metadata
- Download URL: datascientist-0.2.7-py3-none-any.whl
- Upload date:
- Size: 20.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 946ba4af73f561369deabf9922d588f309a232532eec48a369e370c10e97d1f5 |
|
MD5 | a5ccc4a01bc5b9ea46f8e7b3ad9f58f8 |
|
BLAKE2b-256 | be87c028c4e5180d736d35e6c35f951d9ababb2f177348d86c54cd113df9b67d |