[alpha] A package that transform your notebooks and python files into pipeline steps by standardizing the data input / output.

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

stdflow

README OUTDATED

Data flow tool that transform your notebooks and python files into pipeline steps by standardizing the data input / output. [for Data science project]

Data Organization

Format

Data folder organization is systematic and used by the function to load and export. If follows this format: data_name/attrs_1/attrs_2/.../attrs_n/step_name/{data_name}{country_code}{step_name}{version}{attrs}.csv"

where:

data_name: name of the dataset
step_name: name of the step
attrs: additional attributes of the dataset (such as the country)

Pipeline

A pipeline is composed of steps each step should export the data by using export_tabular_data function which does the export in a standard way a step can be

a file: jupyter notebook/ python file
a python function

How to use

Load from raw data source

import stdflow as sf

# basic use-case
dfs = sf.load(
   path='./twitter/france/')  # recommended is: ./twitter/france/step_raw/v_202108021223  or (v_1 / v_demo / ...)
# or
dfs = sf.load(path='./', attrs=['twitter', 'france'], step=False, version=False)
# or
dfs = sf.load(path='./twitter', attrs=['france'], step=False)

sf.export(dfs, step="loaded")  # export in ./twitter/france/step_loaded/v_202108021223

Load from processed data source

import pandas as pd
import stdflow as sf

dfs = sf.load(
   path='./twitter/france/step_processed/v_2_client_intern/data.csv'
)  # automatically use appropriate function if meta-data is available. otherwise, use default with detected extension
# or
dfs = sf.load(
   path='./twitter/france/step_processed/',
   step=True,  # default is True: meaning it detects it from the path
   version="2_client_intern"  # default is last version
)

sf.load(path='./twitter/france/', file='data.csv', step="processed", version="last")

sf.load(pd.read_csv, path='./twitter/france/', file='data.csv', step="processed", version="last", header=None)
sf.load(pd.read_csv, path='./twitter/france/step_processed/v_12/data.csv', header=None)

# or 
dfs = sf.load(path='./twitter/france/step_processed/', step=True, version="last")  # last version is taken
# version keywords: last, first

Multiple data sources

dfs = sf.load(srcs=['./digimind/india/step_processed', './digimind/indonesia/step_processed'])

or the elements one by one

sf.step_in = 'clean'
sf.version_in = 1
# ...

sf.step_name = 'preprocess'
sf.version = 1  # default to datetime
sf.attrs = ['india']  # default to []
# ...

attrs adds the attributes to the file name it is also possible to use out_path. the final out_path is composed of in_path[0] (or out_path if any) + attrs + step_name + version

sf.export_tabular_data(dfs, data_path='./digimind/india/processed', step_name='clean', attrs=['india'], version=1)

Data Loader

Auto: automatically select one of the existing loader based on meta-data
CSVLoader: loads all csv files in a folder
ExcelLoader: loads all excel files in a folder

Recommended steps

You can set up any step you want. However, just like any tools there are good/bad and common ways to use it.

The recommended way to use it is:

Load
- Use a custom load function to load you raw datasets if needed
- Fix column names
- Fix values
  - Except those for which you would like to test multiple methods that impacts ml models.
- Fix column types
Merge
- Merge data from multiple sources
Transform
- Pre-processing step along with most plots and analysis
Feature engineering (step that is likely to see many iterations)

The output of this step goes into the model
- Create features
- Fill missing values
Model
- This step likely contains gridsearch and therefore output multiple resulting datasets
- Train model
- Evaluate model (or moved to a separate step)
- Save model

Best Practices:

Do not use sf.reset as part of your final code
Do not export to multiple path (path + attr_1/attr_2/.../attr_n + step_name) in the same step: only multiple versions
Do not set sub-dirs within the export (i.e. version folder is the last depth). if you need similar operation for different datasets, create pipelines

How the package works

a step is composed of in and out data sources data sources are just folders. The format is path + attr_1/attr_2/.../attr_n + step_name + version

where: attrs_1: usually the name of the dataset attrs_2...n: additional attributes of the dataset (such as the country) step_name: name of the step (optional but recommended so that the usage of the package makes sense) version: version of the data (optional but recommended) default to datetime

each time you load data, the input data sources are saved. This is useful to keep track of the data used in a step. You can reset the loaded data by using sf.reset()

At export time a file with all details about the input and output data is generated and saved in the output folder.

Metadata

Each folder contains one metadata file with the list of all files details. Note that even if with this architecture it is technically possible to generate files in the same folder from different steps (future-proof concerns), it is not recommended and you will get warnings.

{
   "files": [
      {
         "name": "file_name",
         "type": "file_type",
         "step": {
            "attrs": [
               "attr_1",
               "attr_2",
               "...",
               "attr_n"
            ],
            "version": "version",
            "step": "step_name"
         },
         "columns": [
            {
               "name": "column_name",
               "type": "column_type",
               "description": "column_description"
            }
         ],
         "input_files": [
               ...
         ]
      },
      {
         ...
      }
   ]
}

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.73

Oct 12, 2023

0.0.72

Oct 12, 2023

0.0.71

Oct 11, 2023

0.0.70

Oct 11, 2023

0.0.69

Sep 22, 2023

0.0.68

Sep 22, 2023

0.0.67

Sep 22, 2023

0.0.66

Sep 21, 2023

0.0.65

Sep 21, 2023

0.0.64

Sep 21, 2023

0.0.63

Sep 21, 2023

0.0.62

Aug 24, 2023

0.0.61

Aug 22, 2023

0.0.60

Aug 22, 2023

0.0.59

Aug 22, 2023

0.0.58

Aug 22, 2023

0.0.57

Aug 22, 2023

0.0.56

Aug 22, 2023

0.0.55

Aug 22, 2023

0.0.54

Aug 22, 2023

0.0.53

Aug 21, 2023

0.0.52

Aug 21, 2023

0.0.51

Aug 21, 2023

0.0.50

Aug 21, 2023

0.0.49

Aug 21, 2023

0.0.48

Aug 21, 2023

0.0.47

Aug 21, 2023

0.0.46

Aug 21, 2023

0.0.45

Aug 20, 2023

0.0.44

Aug 20, 2023

0.0.43

Aug 20, 2023

0.0.42

Aug 20, 2023

0.0.41

Aug 20, 2023

0.0.40

Aug 11, 2023

0.0.39

Aug 11, 2023

0.0.38

Aug 10, 2023

0.0.37

Aug 10, 2023

0.0.36

Aug 10, 2023

0.0.35

Aug 10, 2023

0.0.34

Aug 2, 2023

0.0.33

Aug 2, 2023

0.0.32

Aug 1, 2023

0.0.31

Jul 31, 2023

0.0.30

Jul 31, 2023

0.0.29

Jul 31, 2023

0.0.28

Jul 31, 2023

0.0.27

Jul 30, 2023

0.0.26

Jul 29, 2023

0.0.25

Jul 29, 2023

0.0.24

Jul 29, 2023

0.0.23

Jul 29, 2023

0.0.22

Jul 29, 2023

0.0.21

Jul 29, 2023

0.0.20

Jul 28, 2023

0.0.19

Jul 28, 2023

0.0.18

Jul 28, 2023

0.0.16

Jul 28, 2023

0.0.15

Jul 28, 2023

0.0.14

Jul 27, 2023

0.0.13

Jul 27, 2023

0.0.12

Jul 27, 2023

0.0.11

Jul 27, 2023

0.0.10

Jul 27, 2023

0.0.9

Jul 27, 2023

0.0.8

Jul 27, 2023

0.0.7

Jul 27, 2023

0.0.6

Jul 27, 2023

This version

0.0.5

Jul 26, 2023

0.0.4

Jul 26, 2023

0.0.3

Jul 25, 2023

0.0.2

Jul 25, 2023

0.0.1

Jul 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stdflow-0.0.5.tar.gz (4.4 kB view hashes)

Uploaded Jul 26, 2023 Source

Built Distribution

stdflow-0.0.5-py3-none-any.whl (4.3 kB view hashes)

Uploaded Jul 26, 2023 Python 3

Hashes for stdflow-0.0.5.tar.gz

Hashes for stdflow-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`613721de67540d280b029ff30afe88b47e294d28666ceab6b41609dad4b9815c`
MD5	`4e649e36b641c690aaf9cd916d96a6a2`
BLAKE2b-256	`1f29db5fe9d4676b949b9a763a88748217fccfe3d7f726bb4459e20a130e8b43`

Hashes for stdflow-0.0.5-py3-none-any.whl

Hashes for stdflow-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3b44b28991a32c8fbeaa79ef25aae4047ac4afdd0efdc695c608baaaea6f490`
MD5	`382d07533e3478d78fa67c0ef18c8698`
BLAKE2b-256	`2ec97225433379ee23f93f02aacb43aa8039a5a07a90f1b2bc722555cb05f7d7`