[alpha] A package that transform your notebooks and python files into pipeline steps by standardizing the data input / output.

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

stdflow

Data flow tool that transform your notebooks and python files into pipeline steps by standardizing the data input / output. (for data science projects)

Create clean data flow pipelines just by replacing you pd.read_csv() and df.to_csv() by sf.load() and sf.save().

Pipelines

import stdflow as sf
from stdflow import Step
from stdflow.pipeline import Pipeline

# set a stdflow variable to be used by a pipeline calling this pipeline notebook
root = sf.var("preprocessing_path", "./")  

def path(ntb):
    os.path.join(root, ntb)

files = [
    "1. formatting.ipynb",
    "2. remove_outliers.ipynb",
    "3. missing_values_imputation.ipynb",
    "4. scaling.ipynb",
]

# create a pipeline with 4 steps
ppl = Pipeline([Step(exec_file_path=ntb) for ntb in files])

# add step 5 twice with different parameters
ppl.add_step(
    Step(
        exec_file_path=path("5. merge.ipynb"),
        exec_variables={
            "country": "france",  # stdflow variable in the notebook "5. merge.ipynb" is configurable
        },
    )
)
ppl.add_step(
    Step(
        exec_file_path=path("5. merge.ipynb"),
        exec_variables={
            "country": "spain",
        },
    )
)

# run the pipeline
ppl.run()

Load and save data

Specify everything

import stdflow as sf
import pandas as pd

# load data from ./data/raw/france/step_raw/v_1/countries of the world.csv
df = sf.load(
   root="./data", 
   attrs=['twitter', 'france'], # or attrs='twitter/france'
   step='raw', 
   version='1', 
   file_name='countries of the world.csv',
   method=pd.read_csv  # or method='csv'
)

# export data to ./data/raw/france/step_processed/v_1/countries.csv
sf.save(
   df, 
   root="./data", 
   attrs=['twitter', 'france'], 
   step='processed', 
   version='1', 
   file_name='countries.csv', 
   method=pd.to_csv  # or method='csv'  or any function that takes the object to export as first input 
)

Each time you perform a save, a metadata.json file is created in the folder. This keeps track of how your data was created and other information.

Specify almost nothing

import stdflow as sf

# use package level default values
sf.root = "./data"
sf.attrs = ['twitter', 'france']  # if needed use attrs_in and attrs_out
sf.step_in = 'raw'
sf.step_out = 'processed'

df = sf.load()  
# ! root / attrs / step : used from default values set above
# ! version : the last version was automatically used. default: ":last"
# ! file_name : the file, alone in the folder, was automatically found
# ! method : was automatically used from the file extension

sf.save(df)
# ! root / attrs / step : used from default values set above
# ! version: used default %Y%m%d%H%M format
# ! file_name: used from the input (because only one file)
# ! method : inferred from file name

Note that everything we did at package level can be done with the Step class

from stdflow import Step

step = Step(root="./data", attrs=['twitter', 'france'], step_in='raw', step_out='processed')
# or set after
step.root = "./data"
# ...

df = step.load(version=':last', file_name=":auto", verbose=True)

step.save(df, verbose=True)

Data visualization

import stdflow as sf
sf.save({'what?': "very cool data"}, export_viz_tool=True) # exports viz folder

The viz folder contains a html page that is going to load the metadata.json file in the parent dir (where you exported) and display the data pipeline.

Under the hood

Data Organization

Format

Data folder organization is systematic and used by the function to load and save. If follows this format: root_data_folder/attrs_1/attrs_2/.../attrs_n/step_name/version/file_name

where:

root_data_folder: is the path to the root of your data folder, and is not exported in the metadata
attrs: information to classify your dataset (e.g. country, client, ...)
step_name: name of the step. always starts with step_
version: version of the step. always starts with v_
file_name: name of the file. can be anything

Each folder is the output of a step. It contains a metadata.json file with information about all files in the folder and how it was generated. It can also contain a html page (if you set html_export=True in save()) that lets you visualize the pipeline and your metadata

Pipeline

A pipeline is composed of steps each step should export the data by using export_tabular_data function which does the export in a standard way a step can be

a file: jupyter notebook/ python file
a python function

Recommended steps

You can set up any step you want. However, just like any tools there are good/bad and common ways to use it.

The recommended way to use it is:

Load
- Use a custom load function to load you raw datasets if needed
- Fix column names
- Fix values
  - Except those for which you would like to test multiple methods that impacts ml models.
- Fix column types
Merge
- Merge data from multiple sources
Transform
- Pre-processing step along with most plots and analysis
Feature engineering (step that is likely to see many iterations)

The output of this step goes into the model
- Create features
- Fill missing values
Model
- This step likely contains gridsearch and therefore output multiple resulting datasets
- Train model
- Evaluate model (or moved to a separate step)
- Save model

Best Practices:

Do not use sf.reset as part of your final code
In one step, export only to one path (except the version). meaning for one step only one combination of attrs and step_name
Do not set sub-dirs within the export (i.e. version folder is the last depth). if you need similar operation for different datasets, create pipelines

TODO: add pipelines TODO: add excalidraw schema TODO: add import export of other data types: [structured, unstructured, semi-structured] TODO: add test loop TODO: example architecture with

data
pipelines
models
tests
notebooks
src
config
logs
reports
requirements.txt
README.md
.gitignore TODO: setup pipelines_root, models_root, tests_root, notebooks_root, src_root, config_root, logs_root, reports_root TODO: common steps of moving a file / deleting a file (requires pipeline) TODO: version :last should use the metadata (datetime in file and of the file to know which one is the last) TODO: option to delete previous version when saving

TODO: setup the situation in which you chain small function in a directory and it deletes the previous file before creating a new one with another name. in the chain it will appear with different names showing the process TODO: a processing step can delete the loaded files. TODO: setting export=False ? delete_after_n_usage=4 ?

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

0.0.73

Oct 12, 2023

0.0.72

Oct 12, 2023

0.0.71

Oct 11, 2023

0.0.70

Oct 11, 2023

0.0.69

Sep 22, 2023

0.0.68

Sep 22, 2023

0.0.67

Sep 22, 2023

0.0.66

Sep 21, 2023

0.0.65

Sep 21, 2023

0.0.64

Sep 21, 2023

0.0.63

Sep 21, 2023

0.0.62

Aug 24, 2023

0.0.61

Aug 22, 2023

0.0.60

Aug 22, 2023

0.0.59

Aug 22, 2023

0.0.58

Aug 22, 2023

0.0.57

Aug 22, 2023

0.0.56

Aug 22, 2023

0.0.55

Aug 22, 2023

0.0.54

Aug 22, 2023

0.0.53

Aug 21, 2023

0.0.52

Aug 21, 2023

0.0.51

Aug 21, 2023

0.0.50

Aug 21, 2023

0.0.49

Aug 21, 2023

0.0.48

Aug 21, 2023

0.0.47

Aug 21, 2023

0.0.46

Aug 21, 2023

0.0.45

Aug 20, 2023

0.0.44

Aug 20, 2023

0.0.43

Aug 20, 2023

0.0.42

Aug 20, 2023

0.0.41

Aug 20, 2023

0.0.40

Aug 11, 2023

0.0.39

Aug 11, 2023

This version

0.0.38

Aug 10, 2023

0.0.37

Aug 10, 2023

0.0.36

Aug 10, 2023

0.0.35

Aug 10, 2023

0.0.34

Aug 2, 2023

0.0.33

Aug 2, 2023

0.0.32

Aug 1, 2023

0.0.31

Jul 31, 2023

0.0.30

Jul 31, 2023

0.0.29

Jul 31, 2023

0.0.28

Jul 31, 2023

0.0.27

Jul 30, 2023

0.0.26

Jul 29, 2023

0.0.25

Jul 29, 2023

0.0.24

Jul 29, 2023

0.0.23

Jul 29, 2023

0.0.22

Jul 29, 2023

0.0.21

Jul 29, 2023

0.0.20

Jul 28, 2023

0.0.19

Jul 28, 2023

0.0.18

Jul 28, 2023

0.0.16

Jul 28, 2023

0.0.15

Jul 28, 2023

0.0.14

Jul 27, 2023

0.0.13

Jul 27, 2023

0.0.12

Jul 27, 2023

0.0.11

Jul 27, 2023

0.0.10

Jul 27, 2023

0.0.9

Jul 27, 2023

0.0.8

Jul 27, 2023

0.0.7

Jul 27, 2023

0.0.6

Jul 27, 2023

0.0.5

Jul 26, 2023

0.0.4

Jul 26, 2023

0.0.3

Jul 25, 2023

0.0.2

Jul 25, 2023

0.0.1

Jul 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stdflow-0.0.38.tar.gz (33.1 kB view hashes)

Uploaded Aug 10, 2023 Source

Built Distribution

stdflow-0.0.38-py3-none-any.whl (43.1 kB view hashes)

Uploaded Aug 10, 2023 Python 3

Hashes for stdflow-0.0.38.tar.gz

Hashes for stdflow-0.0.38.tar.gz
Algorithm	Hash digest
SHA256	`0ce95d4ae86cda5373cc4ec57ade17fd7d8937d6797454318301cf8570c57468`
MD5	`899aa9b30dae0fdbe989ce808965e484`
BLAKE2b-256	`9d9c4c9f00a7d7295bc8b78c98bf3660b8e05b3f4714b90e3a744d62e4aaa12f`

Hashes for stdflow-0.0.38-py3-none-any.whl

Hashes for stdflow-0.0.38-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6431b1d84ba69820812c974070030ad92c1dc35e78231fc6bb8768451ca31c62`
MD5	`a636cd5c1d6115df9d2746d069d3f356`
BLAKE2b-256	`8c70e92b59896e5e796a44a1a13d10d246c420db4ccb65139418e1884804bd43`