Skip to main content

A light set of enablers based on Cloudframe's proprietary data science codebase.

Project description

The Ephemerai Data Scientist Enabler

At Ephemerai we employ teams of Data Scientists, Data Engineers, and Software Developers. Check us out at http://ephemer.ai

If you're interested in joining our team as a Data Scientist see here: Bid Prediction Repo. There you'll find a fun problem and more info about our evergreen positions for Data Scientists, Data Engineers, and Software Developers.

This package contains some convenience functions meant help a Data Scientist:

  • get data into a format that is useful for training models,
  • track experiments as a natural workflow, and
  • use common cloud resources like AWS S3.

It is a light version of some of our proprietary enablers that we use to deliver data-informed products to our clients. The workflow sub-module contains tracker which is intended to support data science experimentation.

Installation

pip install datascientist

Dependencies

In addition to the following packages, datascientist requires that you have the credentials (et cetera) to perform the operation required. For example, when connecting to a Redshift database you must have the correct credentials stored either as environment variables (see the example bash profile) or in an rs_creds.json file located in the home directory.

  • pandas
  • numpy
  • psycopg2
  • PyYAML

Structure

data-scientist/
|
|-- connections/
|   |-- __init__.py
|   |-- rsconnect.py
|
|-- workflow/
|   |-- __init__.py
|   |-- tracker.py
|
|-- special/
|   |-- __init__.py
|   |-- s3session.py
|
|-- Manifest.in
|-- README.md
|-- setup.py
|-- bash_profile_example

Usage

connections.rsconnect

A set of convenience functions for interacting with a Redshift database. In addition to merely establishing connections and fetching data, this sub-module can perform do things like:

  • Infer the schema of your DataFrame
  • CREATE and DROP tables
  • WRITE data to a table
  • Perform an UPSERT operation
  • Get the names of tables in your cluster
  • Et cetera

For example, upsert data or write a new table:

import connections.rsconnect as rs

### Store a local file to S3

bucket, key = rs.df_to_s3(
  df, 
  bucket = 'my-bucket', 
  key = 'location/on/s3/my-file.csv',
  primary = 'my_primary_key'
)

### If the table exists, perform an upsert operation from the CSV
### If it doesn't, create a table from the CSV

tname = 'my_table'
fields = rs.infer_schema(df)
if rs.table_check(tname):
    _ = rs.upsert_table(
      tname, 
      fields, 
      bucket = bucket,
      key = key,
      primary = 'my_primary_key'
    )

else:
    _ = rs.create_table(
      tname, 
      fields,
      primary = 'my_primary_key'
    )
    _ = rs.write_data(
      tname,
      bucket,
      key
    )

Note also that the function to fetch data is: rs.sql_to_df().

workflow.tracker

The workflow.tracker provides a lightweight tool for tracking a data science workflow. It is intended to help data scientists produce human-readable artifacts and obviate the need for things like complex naming conventions to keep track of the state of modeling experiments. It also has features to enable reproducibility, iterative improvment, and model deployent in a cloud environment (AWS right now).

The fundamental object of this library is the Project class. A Project conceptually is a single effort to build a Machine Learning function to address a particular problem. Individual experiments are conceptualized as 'runs'. A Run covers the data science workflow from data conditioning (post ETL and feature generation) through model validation and testing.

For more information and to learn how to use the Workflow Tracker, see the sample notebooks in the 'cloud-event-modeling' repository.

special.s3session

The special.s3session module contains a set of convenience functions for creating an S3 session with credentials, checking a bucket's existence, listing a bucket's objects, and the like.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascientist-0.2.7.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

datascientist-0.2.7-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file datascientist-0.2.7.tar.gz.

File metadata

  • Download URL: datascientist-0.2.7.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for datascientist-0.2.7.tar.gz
Algorithm Hash digest
SHA256 b85618637d2c50cbb148e1eeafc4ac13a6c34ded4e913b3e4e8c01c33b43a0f6
MD5 4a9634d00e3fc71547bf8601d97e4b61
BLAKE2b-256 8e99460645da24e598fc94fd9e7e01e70628a64b1cf018bb4c95a7f4a197dee8

See more details on using hashes here.

File details

Details for the file datascientist-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: datascientist-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for datascientist-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 946ba4af73f561369deabf9922d588f309a232532eec48a369e370c10e97d1f5
MD5 a5ccc4a01bc5b9ea46f8e7b3ad9f58f8
BLAKE2b-256 be87c028c4e5180d736d35e6c35f951d9ababb2f177348d86c54cd113df9b67d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page