A data science experiment framework.
Project description
Start Data Science
An opinionated, organized way to start and manage data science experiments.
Start Data Science is a template to help you set up experiments. It brings structure to exploratory data analysis (EDA), through to feature extraction, modeling, and resultant outputs whether they're figures, reports, APIs, or apps.
The main components are:
-
A pre-defined framework creating organization for your experiment
-
A pre-compiled
requirements.txt
featuring over 150 commonly used data science libraries -
Extensible scripts with boilerplate for Streamlit, Flask, FastAPI and Cortex.
-
(Work in progress) A library of common helpers like writing and reading to S3, methods to clean, transform and extract features from data
-
(Work in progress) Adding more open source solutions for apis and apps (eg BentoML)
The idea of this repo is to provide a comprehensive structure. The user is to delete portions, and manipulate the dag accordingly per their experiment needs.
Getting Started
- Install the library
pip install startds
- Create your first experiment
startds create <exp_name>
Usage
Available Commands
- create
startds create <exp_name> [--api flask|fastapi|cortex|all(default)] [--mode eng|ds|all(default)]
Creates a new experiment directory structure. where exp_name
is the name of the new experiment you want to create. This will create a new folder named exp_name
in the current folder.
Options that can be provided with this command are
--api
and --mode
.
--api
can only take on one of these values : flask
, fastapi
, cortex
, all
(default value is all
).
This will accordingly create boilerplate code for that specific api tool in your _apis
folder.
--mode
can only take on one of these values : eng
, ds
, all
(default value is all
).
This affects which folders are created in the src
folder.
If eng
is used as the mode, only folders specific to engineering operations will be created : _apis
, _apps
, _orchestrate
, _tests
.
if ds
is used as the mode, only folders specific to data science operations will be created : clean
, explore
,
transform
, train
.
If no options are provided, the full experiment directory will be created.
- env
startds env [-f path_to_requirements.txt]
This command creates a virtual environment in the directory from where it is run. It is recommended to run this command from the home directory of the new experiment you created with startds create
.
Option -f
can be used to specify your custom requirements.txt
. In the absence of this option, the default requirements.txt
located at the root directory of your experiment will be used. You can also simply overwrite that default file.
The default env command initializes a virtual environment for the experiment with over 150 of the most commonly used data science libraries. Note, it installs airflow
which is required in order to execute the dag.
To start the new virtual environment created, run
source .venv/bin/activate
Running the experiment
python run.py
Runs dag.py
which is configured using airflow
by default. Note: you will require airflow, or you can configure using your preferred orchestrator. The dag can be easily modified to add or remove steps, and/or execute individual components.
Running tests
Tests are to be written in the _tests folder inside src folder. pytest
package can be used to run these tests.
Make sure that pytest
is installed and run
pytest
from the root directory or the _tests
directory to run tests
The resulting directory structure
The directory structure of your new project looks like this:
├── README.md <- The top-level README for developers using this project.
│
├── Dockerfile <- Dockerfile to create docker images for K8s or other cloud services
|
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
|
├── setup.py <- makes project pip installable (pip install -e .) to enable imports of sibling modules in src
|
├── run.py <- Run file that calls an orchestrator or individual .py files in your project
|
├── exp_name <- namesake folder inside the exp_name root folder that you created
| |
| ├── metadata <- Metadata that needs to persisted and shared for data sources and models
| │ └── data.md
| │ └── models.md
| |
| ├── models <- Trained and serialized models, model predictions, or model summaries
| |
| ├── notebooks <- Folder to keep notebooks in. Import .py modules from src folder
| │
| ├── outputs <- Generated analysis as HTML, PDF, LaTeX, etc.
| │ └── figures <- Generated graphics and figures to be used in reporting
| |
| ├── src <- Source code for use in this project.
| │ ├── __init__.py <- Makes src a Python module
| | |
| | ├── _apis <- Scripts to create APIs for serving models using Flask/FastAPI/others
| | | └── fastapi
| | | | └── main.py
| | | | └── Dockerfile
| | | | └── build.sh
| | | |
| | | └── flask
| | | | └── main.py
| | | | └── Dockerfile
| | | | └── build.sh
| | | |
| | | └── cortex
| | | └── main.py
| | | └── cortex.yaml
| | | └── requirements.txt
| | |
| │ ├── _apps <- Scripts to create internal ML apps using streamlit, dash etc
| | | └── streamlit
| | | └── main.py
| | | └── Dockerfile
| | | └── build.sh
| | |
| | ├── _orchestrate <- Scripts to run different steps of the project using an orchestrator such as airflow
| │ | └── airflow
| | | └── dags
| | | └── dag.py
| | |
| │ ├── _tests <- Scripts to add tests for your experiment
| │ │ └── test_clean.py
| | |
| │ ├── clean <- Scripts to connect and clean data
| │ │ └── clean_data.py
| | | └── connect_data.py
| | |
| │ ├── explore <- Scripts to create exploratory and results oriented visualizations
| │ | └── visualize.py
| │ | └── explore.py
| │ │
| │ ├── transform <- Scripts to turn raw data into features for modeling
| │ │ └── transform_data.py
| | | └── setup_experiment.py
| │ │
| │ ├── train <- Scripts to train models and then use trained models to make predictions
| │ │ ├── predict_model.py
| │ │ └── train_model.py
| | | └── model.py
| │ │
Note about importing sibling modules
To enable importing sibling modules when writing code in src, it is best to install the root experiment as a python package
pip install -e .
You could also modify the sys.path
in each file that wants to import sibling module
Another solution is to run files form the root folder using python3 -m absolute_import_path_to_module
Some reference for this issue Sibling package imports
If you have ideas about how to manage this structure better, please let us know.
Contributing to start-data-science
Feel free to open an issue against this repository or contact us and we'll help point you in the right direction.
License
Released under the MIT license.
Acknowledgements
A huge thanks to the following projects:
Structure / Inspiration:
Django
Cookiecutter Data Science
Integrations:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.