Skip to main content

A data science experiment framework.

Project description

Start Data Science

An opinionated, organized way to start and manage data science experiments.

Start Data Science is a template to help you set up experiments. It brings structure to exploratory data analysis (EDA), through to feature extraction, modeling, and resultant outputs whether they're figures, reports, APIs, or apps.

 

The main components are:

  • A pre-defined framework creating organization for your experiment

  • A pre-compiled requirements.txt featuring over 150 commonly used data science libraries

  • Extensible scripts with boilerplate for Streamlit, Flask, FastAPI and Cortex.

  • (Work in progress) A library of common helpers like writing and reading to S3, methods to clean, transform and extract features from data

  • (Work in progress) Adding more open source solutions for apis and apps (eg BentoML)

 

The idea of this repo is to provide a comprehensive structure. The user is to delete portions, and manipulate the dag accordingly per their experiment needs.

 

Getting Started

  1. Install the library
pip install startds
  1. Create your first experiment
startds create <exp_name>

Usage

Available Commands

  • create
startds create <exp_name> [--api flask|fastapi|cortex|all(default)] [--mode eng|ds|all(default)]

Creates a new experiment directory structure. where exp_name is the name of the new experiment you want to create. This will create a new folder named exp_name in the current folder.

Options that can be provided with this command are --api and --mode.

--api can only take on one of these values : flask, fastapi, cortex, all (default value is all). This will accordingly create boilerplate code for that specific api tool in your _apis folder.

--mode can only take on one of these values : eng, ds, all (default value is all). This affects which folders are created in the src folder.

If eng is used as the mode, only folders specific to engineering operations will be created : _apis, _apps, _orchestrate, _tests.

if ds is used as the mode, only folders specific to data science operations will be created : clean, explore, transform, train.

If no options are provided, the full experiment directory will be created.  

  • env
startds env  [-f path_to_requirements.txt]

This command creates a virtual environment in the directory from where it is run. It is recommended to run this command from the home directory of the new experiment you created with startds create.

Option -f can be used to specify your custom requirements.txt. In the absence of this option, the default requirements.txt located at the root directory of your experiment will be used. You can also simply overwrite that default file.

The default env command initializes a virtual environment for the experiment with over 150 of the most commonly used data science libraries. Note, it installs airflow which is required in order to execute the dag.

To start the new virtual environment created, run

source .venv/bin/activate

 

Running the experiment

python run.py

Runs dag.py which is configured using airflow by default. Note: you will require airflow, or you can configure using your preferred orchestrator. The dag can be easily modified to add or remove steps, and/or execute individual components.

 

Running tests

Tests are to be written in the _tests folder inside src folder. pytest package can be used to run these tests. Make sure that pytest is installed and run

pytest

from the root directory or the _tests directory to run tests

The resulting directory structure

The directory structure of your new project looks like this:

├── README.md          <- The top-level README for developers using this project.
│
├── Dockerfile         <- Dockerfile to create docker images for K8s or other cloud services
|
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
|
├── setup.py           <- makes project pip installable (pip install -e .) to enable imports of sibling modules in src
|
├── run.py             <- Run file that calls an orchestrator or individual .py files in your project
|
├── exp_name           <- namesake folder inside the exp_name root folder that you created
|   |
|   ├── metadata           <- Metadata that needs to persisted and shared for data sources and models
|   │   └── data.md
|   │   └── models.md
|   |
|   ├── models             <- Trained and serialized models, model predictions, or model summaries
|   |
|   ├── notebooks          <- Folder to keep notebooks in. Import .py modules from src folder
|   │
|   ├── outputs            <- Generated analysis as HTML, PDF, LaTeX, etc.
|   │   └── figures        <- Generated graphics and figures to be used in reporting
|   |
|   ├── src                <- Source code for use in this project.
|   │   ├── __init__.py    <- Makes src a Python module
|   |   |
|   |   ├── _apis          <- Scripts to create APIs for serving models using Flask/FastAPI/others
|   |   |   └── fastapi
|   |   |   |   └── main.py
|   |   |   |   └── Dockerfile
|   |   |   |   └── build.sh
|   |   |   |
|   |   |   └── flask
|   |   |   |   └── main.py
|   |   |   |   └── Dockerfile
|   |   |   |   └── build.sh
|   |   |   |
|   |   |   └── cortex
|   |   |       └── main.py
|   |   |       └── cortex.yaml
|   |   |       └── requirements.txt
|   |   |
|   │   ├── _apps          <- Scripts to create internal ML apps using streamlit, dash etc
|   |   |   └── streamlit
|   |   |       └── main.py
|   |   |       └── Dockerfile
|   |   |       └── build.sh
|   |   |
|   |   ├── _orchestrate       <- Scripts to run different steps of the project using an orchestrator such as airflow
|   │   |   └── airflow
|   |   |       └── dags
|   |   |           └── dag.py
|   |   |
|   │   ├── _tests         <- Scripts to add tests for your experiment
|   │   │   └── test_clean.py
|   |   |
|   │   ├── clean          <- Scripts to connect and clean data
|   │   │   └── clean_data.py
|   |   |   └── connect_data.py
|   |   |
|   │   ├── explore        <- Scripts to create exploratory and results oriented visualizations
|   │   |   └── visualize.py
|   │   |   └── explore.py
|   │   │
|   │   ├── transform      <- Scripts to turn raw data into features for modeling
|   │   │   └── transform_data.py
|   |   |   └── setup_experiment.py
|   │   │
|   │   ├── train          <- Scripts to train models and then use trained models to make predictions
|   │   │   ├── predict_model.py
|   │   │   └── train_model.py
|   |   |   └── model.py
|   │   │

 

Note about importing sibling modules

To enable importing sibling modules when writing code in src, it is best to install the root experiment as a python package

pip install -e .

You could also modify the sys.path in each file that wants to import sibling module Another solution is to run files form the root folder using python3 -m absolute_import_path_to_module

Some reference for this issue Sibling package imports

If you have ideas about how to manage this structure better, please let us know.  

Contributing to start-data-science

Feel free to open an issue against this repository or contact us and we'll help point you in the right direction.

License

Released under the MIT license.

Acknowledgements

A huge thanks to the following projects:

Structure / Inspiration:

Django
Cookiecutter Data Science

Integrations:

Streamlit
Cortex
FastAPI
Flask
Airflow

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for startds, version 0.1.2
Filename, size File type Python version Upload date Hashes
Filename, size startds-0.1.2.tar.gz (25.5 kB) File type Source Python version None Upload date Hashes View
Filename, size startds-0.1.2-py3-none-any.whl (34.0 kB) File type Wheel Python version py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page