PredictLand Data Science Framwork
Project description
title: PLPipes - PredictLand Data Science Framework
Introduction
PredictLand, the company behind PLPipes, is a consultancy firm focused on Data Science and related fields (Data Analytics, AI and ML, Big Data, Data Engineering, etc.). We work for customers that range from small companies with just a few employees to big corporations and that requires us to be very flexible in the way we work as the platforms, IT systems and tools we can use in every project vary greatly.
More specifically it is not uncommon for us to work on projects where all the infrastructure we have available to carry on our tasks are just our laptops! Yes, that's it, no fancy environments like Databricks or Snowflake, no cloud instances with hundreds of RAM GBs, no data automation services as Azure Data Factory or DBT, sometimes not just a simple Database server... It is just our laptops, a git repository and maybe some Excel files with the data.
So that's one way to think about PLPipes, a Data Science framework on the cheap, a poor-man Databricks replacement!
But even if that is true, we prefer to think of PLPipes as a lean and very scalable framework. Something that you can use to train some models from a few CSVs in your laptop, to process terabytes of information on a cloud cluster, to embed in a lambda function, to run some model inside a Docker container, etc.
So, what is exactly PLPipes?
Several things:
-
It is a thin layer integrating several technologies so that they can be used easily and efficiently to solve common data science problems.
-
It is an automation framework for creating data processing pipelines.
-
It is a programming framework for reducing boilerplate, enforcing some best-practices and providing support for common tasks.
-
It is also a mindset and a way to standardize Data Science project development.
-
It is a very customizable framework with sane defaults, so that you can start working on your projects right there without having to perform a complex setup up front.
-
It is a work in process yet! Even if the ideas behind PLPipes are not new and we have used/implemented them in different forms and in different projects in the past (or in some cases, just copied them from other 3rd party projects), the framework is still very new and most of it should be considered experimental!
Nomenclature
We use PLPipes to refer to the framework as a whole (projects, code,
conventions, mindset, etc.) and plpipes to refer specifically to the
Python package.
Overview
So, how is the typical PLPipes project?
PLPipes projects are organized around actions which can be considered as atomic units of work. Examples of actions are downloading a file, transforming some data or training a model.
Actions are grouped in sequences to create data processing pipelines. Several pipelines can be defined inside one project, and it is even possible to change which actions form a pipeline dynamically depending on the deployment environment, the configuration, command line arguments, etc.
Another key concept of PLPipes is that a relational database is used to pass information between actions (alternatively the file system can be used, but the database is preferred).
The pipelines are launched by the runner, which is
nothing else than a Python script that calls into plpipes and is
able to handle command line arguments, configuration files and
environment variables in a unified way.
In summary, when using PLPipes, instead of a bunch of scripts, every one doing something different, we have a set of pipelines built on top of actions that use a relational database to store intermediate data and we use a standardized python script to get everything running.
Finally other of the key features available from PLPipes is a powerful configuration system.
Project Setup
This chapter describes how to set up a PLPipes project from scratch.
PLPipes is quite configurable and most of its workings can be changed and redefined, but that doesn't preclude it from offering some sane defaults that we advise you to follow.
Specifically, by default, it expects some directory structure and a main script which is used to organize the project operations as described in the following sections:
Directory structure
A PLPipes project is structured in the following directories which should be created by hand (development of a utility to do it automatically is planed).
-
lib(optional): This is where reusable Python modules specific to the project are stored. -
bin: This is the place where to place scripts for the project. Though, usually if just contains the main scriptrun.py.Other scripts can be placed here, but it should be noted that the Actions mechanism available through
run.pyis the preferred way to organize the project operations. -
actions: Action definitions. See Actions below. -
notebooks(optional): Jupyter notebooks go here -
config: Configuration files are stored here. See Configuration. -
defaults(optional): Default configuration files go here (the contents of this directory should be committed to git).The semantic distinction between
defaultsandconfigis something we are still considering and that may change. -
input(optional): Project input files. -
work: Working directory, intermediate files go here.Also, the default working database is stored here as
work/work.duckdb. -
output(optional): Final output files generated can go here. -
venv(optional): Even ifplpipesdoes not depend on it, we recommend to use a virtual environment for the project whit that name.
The main script
bin/run.py is the main entry point for PLPipes and should be
created by hand with the following content:
#!/usr/bin/env python3
from plpipes.runner import main
main()
Installing plpipes
The Python module plpipes can be installed in two ways.
Installing a packed version
This is the way to install the module we recommend when you don't want to contribute to the development of the framework and just want to use it.
Note that In practice, as plpipes is still in a very early
development stage, that may not be a realistic assumption and you may
be required to switch to the development version available from git
quite soon!
Somehow (!) obtain the module wheel and install it using pip:
pip install /path/to/.../plpipes-0.1-py2.py3-none-any.whl
Hopefully, plpipes would be directly available from
PyPI soon!
Installing from git
-
Clone the repository outside of your project directory and switch to the
developbranch:git clone git@github.com:PredictLand/PL-TEC-PLPipes.git cd PL-TEC-PLPipes git checkout develop
-
Add the
srcsubdirectory to Python search path.# Linux and/or bash: export PYTHONPATH=path/to/.../PL-TEC-PLPipes/src # Windows set PYTHONPATH=C:\path\to\...\PL-TEC-PLPipes\src
-
Check that it works:
python -m plpipes -c "print('ok')"
Alternatively you can modify your project main script to append
thesrc directory to the module search path so that you don't need to
set PYTHONPATH by hand every time you start a new session.
For instance:
from pathlib import Path
import sys
sys.path.append(str(Path.cwd().parent.parent.parent / "PL-TEC-PLPipes/src"))
from plpipes.runner import main
main()
Or you could also set PYTHONPATH from your shell startup script
(~/.profile) or in the Windows registry.
Using PLPipes
PLPipes is comprised of several modules which can be used together or independently.
Configuration
The configuration module is one of the core components of PLPipes
pervasively used by plpipes itself, so even if you don't want to use
it directly in your project it would be used internally by the
framework.
Configuration data is structured in a global tree-like object which is initialized from data read from several files in sequence and from the command line.
Both YAML and JSON files are supported (though, we recommended YAML usage as it is usually easier to read by humans).
When the same setting appears in several configuration files, the last one read is the one that prevails.
File structure
The list of files from with the configuration is read is dynamically calculated based on two settings:
-
The script "stem": It is the name of the script run without the extension (for instance, the stem for
run.pyisrun).When
plpipesis used from a Jupyter notebook, the stem can be passed on the%plpipesline magic:%plpipes foobalizer
- The deployment environment (
dev,pre,pro, etc.): this can be set from the command line or using the environment variablePLPIPES_ENV(see Environment variables below). It defaults todev.
Also, there are two main directories where configuration files are stored:
-
default: This directory should contain configuration files that are considered defaults and that are not going to be changed by the project users. We think of it as the place where to place setting that otherwise would be hard-coded. -
config: This directory contains configuration files which are editable by the project users or where developers can put temporary settings they don't want to push into git.
We are currently considering whether this division makes sense or if we should otherwise replace it by something better
When PLPipes configuration module is initialized it looks in those two directories for files whose names follow the following rules:
-
Base name: the base name is taken as
commonor the stem so that, for instance, when loading the configuration fromrun.py, bothcommon.yamlandrun.yamlfiles would be taken into account. -
Secrets: files with a
-secretspost-fix are also loaded (for instance,common-secrets.yamlandrun-secrets.yaml). -
Environment: files with the deployment environment attached as a post-fix are also loaded (
run-dev.yamlorrun-secrets-dev.yaml).
Additionally two user-specific configuration files are considered. Those are expected to contain global configuration settings which are not project specific as API keys, common database definitions, etc.
~/.config/plpipes/plpipes.yaml
~/.config/plpipes/plpipes-secrets.yaml
Finally, when using the default runner (See Runner below), the user can request additional configuration files to be loaded.
In summary, the full set of files which are consider for instance,
when the run.py script is invoked in the dev environment is as
follows (and in this particular order):
~/.config/plpipes/plpipes.json
~/.config/plpipes/plpipes.yaml
~/.config/plpipes/plpipes-secrets.json
~/.config/plpipes/plpipes-secrets.yaml
default/common.json
default/common.yaml
default/common-dev.json
default/common-dev.yaml
default/common-secrets.json
default/common-secrets.yaml
default/common-secrets-dev.json
default/common-secrets-dev.yaml
default/run.json
default/run.yaml
default/run-dev.json
default/run-dev.yaml
default/run-secrets.json
default/run-secrets.yaml
default/run-secrets-dev.json
default/run-secrets-dev.yaml
config/common.json
config/common.yaml
config/common-dev.json
config/common-dev.yaml
config/common-secrets.json
config/common-secrets.yaml
config/common-secrets-dev.json
config/common-secrets-dev.yaml
config/run.json
config/run.yaml
config/run-dev.json
config/run-dev.yaml
config/run-secrets.json
config/run-secrets.yaml
config/run-secrets-dev.json
config/run-secrets-dev.yaml
Automatic configuration
There are some special settings that are automatically set by the framework when the configuration is initialized:
-
fs: The file system sub-tree, contains entries for the main project subdirectories (rootwhich points to the project root directory,bin,lib,config,default,input,work,outputandactions). -
env: The deployment environment -
logging.level: The logging level.
All those entries can be overridden in the configuration files.
Wildcards
In order to simplify the declaration of similar configuration subtrees, a wildcard mechanism is provided.
Entries named * (an asterisk) are copied automatically into sibling
configurations.
For instance, in the following configuration most of the database
connection parameters for input and work instances are obtained
from the * entry.
db:
instance:
'*':
driver: azure_sql
server: example.databse.windows.net
user: jtravolta
password: grease78
input:
database: data_source
work:
database: tempdb
Python usage
The configuration is exposed through the plpipes.cfg object.
It works as a dictionary which accepts dotted entries as keys. For instance:
from plpipes import cfg
print(f"Project root dir: {cfg['fs.root']}")
A sub-tree view can be created using the cd method:
cfs = cfg.cd('fs')
print(f"Project root dir: {cfs['root']}")
Most dictionary methods work as expected. For instance it is possible to mutate the configuration or to set defaults:
cfg["my.conf.key"] = 7
cfg.setdefault("my.other.conf.key", 8)
Though note that configuration changes are not backed to disk.
Config Initialization
The method init of the module plpipes.init is the one in charge of
populating the cfg object and should be called explicitly in scripts
that want to use the configuration module without relying in other
parts of the framework.
plpipes.init.init is where the set of files to be loaded based on
the stem and on the deployment environment is calculated and where
they are loaded into the configuration object.
Automatic configuration is also performed by this method.
Note that plpipes.init is a low level package that is not expected
to be used directly from user code. Instead you should use the methods
provided in plpipes.runner which take care of initializing the
environment and also the configuration subsystem.
Database
plpipes provides a simple way to declare and use multiple database
connections and a set of shortcuts for simplifying some procedures
common in a Data Science context (i.e. running a query and getting
back a DataFrame or creating a new table from a DataFrame).
Default database
One of the key points of the framework is that a locally stored DuckDB database is always available for usage with zero setup work.
Also, as most things in PLPipes, that default database (AKA as work
database) is also configurable, so for instance, it can be changed to
be a PostgreSQL one running in AWS for the production environment or
to use a SQLite one because of its GIS support or whatever.
Database configuration
Database configuration goes under the db.instance sub-tree where the
different database connections can be defined.
For instance, a input database connection backed by a SQL Server
database running in Azure can be declared as follows:
db:
instance:
input:
driver: azure_sql
server: my-sql-server.database.windows.net
database: customer-db
user: predictland
The db.instance.*.driver key is used to find out which driver to use
to establish the connection.
The db.instance.*.backend key is used to stablish the DataFrame
library backend used for the database instance. See
Database-backends.
The remaining configuration entries are driver specific and as follow:
DuckDB configuration
driver:duckdbfile: name of the database file. Defaults to{instance_name}.duckdb.
If the instance is named input or output, the database file is
placed inside the matching directory (for instance,
input/input.duckdb).
Otherwise it is placed in the work directory (example:
work/other.duckdb).
SQLite configuration
driver:sqlitefile: database file name.
Works in exactly the same way as DuckDB but using sqlite as the
database file extension.
Spatialite configuraion
Spatialite is an extension of SQLite designed to facilitate the manipulation of geographic data.
driver:spatialitefile: databse file name.
The extension must be installed. If you are using Conda it is
available from the conda-forge repository and can be installed as
follows:
conda install libspatialite -c conda-forge
Note that Spatialite database files also use the sqlite extension.
SQLServer configuration
driver:sql_serverserverdatabaseuserpasswordencrypt: defaults totrue.trusted_server_certificate: defaults totrue.timeout: defaults to 60s.
Also, in order to connect to a SQL-Server database the ODBC driver must be installed. It is available from here.
AzureSQL configuration
driver:azure_sqlserver: full qualified server name. It can be seen at the Summary page for the database in Azure Portal. It usually has a name likefoo.database.windows.net).database: Database instance name. It is usually (always?) also the name of the Azure resource.credentials: Name of the Azure credential group to be used for authentication. See the Azure chapter below.
Example:
db:
instance:
input:
driver: azure_sql
server: foo.database.windows.net
database: megadb-2000
credentials: bar
cloud:
azure:
auth:
bar:
driver: azure_cli
Again, the SQL-Server ODBC driver must also be installed.
Other databases configuration
Not implemented yet, but just ask for them!!!
Database usage
plpipes.database provides a set of functions for accessing the
databases declared in the configuration.
Most of the functions provided accept an optional db argument, for
selecting the database instance. When db is omitted, work is used
as the default.
For example:
from plpipes.database import query, create_table
df = query("select * from order when date >= :ld", {'ld': '2018-01-01'}, db='input')
create_table('recent_orders', df, db='output')
A list of the most commonly used functions from plpipes.database
follows:
query
query(sql, parameters=None, db='work')
Submits the query to the database and returns a pandas dataframe as the result.
read_table
read_table(table_name, db="work", columns=None)
Reads the contents of the table as a dataframe.
The columns to be loaded can be specified with the columns optional
argument.
execute
execute(sql, parameters=None, db='work')
Runs a SQL sentence that does not generate a result set.
execute_script
execute_script(sql_script, db='work')
Runs a sequence of SQL sentences.
This method is an unstable state, waiting for a proper implementation to happen :-)
create_table
create_table(table_name, df, db="work",
if_exists="replace")
create_table(table_name, sql,
parameters=None,
db="work",
if_exists="replace)
This method can be used to create a new table both from a dataframe or from a SQL sentence.
copy_table
copy_table(source_table_name, dest_table_name=source_table_name,
source_db="work", dest_db="work", db="work",
if_exists="replace", **kws)
Copies table source_table_name from database source_db into
dest_table_name at database dest_db.
update_table
update_table(source_table_name, dest_table_name=source_table_name,
source_db="work", dest_db="work", db="work",
key=None, key_dir=">=")
Updates table dest_table_name at database dest_db with the
missing rows from source_table_name at source_db.
key points to a column with monotonic values which is used to
identify the new rows in the source table.
key_dir indicates whether the key column monotony is strictly
ascending (>), ascending (>=), descending (<=) or strictly
descending (<).
For instance, for a date column, whose values always increase, but
which may have duplicates, the right value is >=. In other words,
the operator used answers to the question "how are the new values in
the table?"
begin
with begin(db='work') as conn:
df = conn.query(sql1)
df = conn.execute(sql2)
...
This method returns a database connection with an open transaction.
The transaction is automatically commited when the with block is done unless an exception is raised. In that case, a rollback is performed.
Connection class
The connection class is returned by calling begin.
connection(db='work')
Returns a SQLAlchemy connection (created by begin).
Also useful for integrating plpipes with other third party modules
or for using other SQLAlchemy methods not directly wrapped by
plpipes.
Database backends
Besides pandas, which is the de-facto standard in the Python Data-Science context for representing tabular data, there are other libraries than for certain problems may be more suitable (for instance, geopandas for the manipulation of geo-referenced data).
PLPipes has a set of plugable backends controlling how data from the database is serialized/deserialized into the different DataFrame implementations.
So far, backends for pandas and geopandas are provided. Others for
polars, spark,
vaex or dassk will be
added as the need arises.
In any case, note that changing the backend, usually also requires changing the code that uses the dataframes as every library provides its own similar but incompatible API.
Every backend may also accept custom keyword arguments. See Backend specifics bellow.
Picking the backend
For database write operations (i.e. create_table), plpipes can
infer which backend to use just looking at the dataframe object type,
so as long as the backend is loaded, plpipes will use the right one
automatically.
The function plpipes.database.load_backend can be used to load a
specific backend into a database driver:
plpipes.database.load_backend("geopandas", db="input")
Currently, under the hood, backends are attached to the driver
class. Once a backend is loaded, for instance, for a azure_sql
database, every other database using such driver will have the backend
available for write operations.
In the case of read operations, there is no way for plpipes to infer
the desired backend and so it must be stated explicitly in one of the
following ways:
-
Passing it as an argument in database read functions (i.e.
read_table,query,query_chunkedandquery_group). For instance:df = plpipes.database.query(sql, backend="spark")
-
In the database connection configuration. For instance:
db: instance: work: backend: polars
-
Every database driver can set its own default. For instance, currently, the
spatialitedriver setsgeopandasas its default backend.
Read operations transparently call load_backend as needed. The
default backend is also loaded automatically when the database is
initialized.
Backend specifics
pandas backend
This is the default backend.
geopandas backend
The geopandas backend can handle both geopandas and regular
pandas dataframes.
In read operations, the argument geom_col must be used to indicate
which column contains the geometric data.
If the argument is ommited, the backend returns a regular pandas dataframe.
Example:
df = db.query("select * from countries", geom_col="geometry")
In order to read geometric data from the database the backend may
mangle the query in order to transform the geometric column values
into the right format for geopandas.read_postgis
method. Specifically, in the case of Spatialite, it wraps the
geometric column in the query as Hex(ST_AsBinary(geom_col)).
Alternatively, and in order to avoid such processing, the
wkb_geom_col argument can be used instead. In that case, it is the
programmer responsability to write a query returning the values in
such colum in a format supported by geopandas (wkb stands for Well
Known
Binary).
Actions
Actions are the atomic units of work that when combined allow one to perform the tasks required by the project.
They are defined inside the actions directory in a hierarchical way.
There are several types of actions predefined and also new ones can be added.
Actions are declared with a configuration file with the name of the
action, for instance preprocessor.yaml.
Inside this configuration file the action type must be declared using
the type setting. For instance:
type: python_script
Alternatively, plpipes can autodetect an action type when it finds a
file with the action name and some recognized extension (for example,
model_training.py). In that case the configuration file is not
required.
The list of currently supported action types follows:
python_script
Extension: .py
The python code in the file is executed.
The following objects are directly available in the script:
-
plpipes: the mainplpipespackage. -
cfg: the configuration object. -
action_cfg: the action configuration (read from the action yaml file or from the global configuration). -
db: a shortcut for theplpipes.databasepackage.
sql_script
Extension .sql
Runs the SQL sentences in the action file against the work database.
The SQL code is preprocessed using Jinja. That feature can be used to for instance, set values from the configuration:
CREATE TABLE foo AS
SELECT * FROM bar
WHERE data >= "{{ cfg["data.limits.date.low.cutoff"] }}"
Currently this action type is only supported when work is backed by
a SQLite database.
sql_table_creator
Extension .table.sql
Runs the SQL query in the file and stores the output data frame in a new table with the name of the action.
Jinja is also used to preprocess the SQL statement.
qrql_script
Extension: .prql
PRQL (Pipelined Relational Query Language) is an alternative query language for relational databases.
This action runs the PRQL sentences in the file against the work
database.
Jinja is used to preprocess the PRQL statement.
Currently this action type is only supported when work is backed up
by a SQLite database.
qrql_table_creator
Runs the PRQL query in the file and stores the output data frame in a new table with the name of the action.
Jinja is also used to preprocess the PRQL statement.
quarto
Extension: .qmd
Processes the file using quarto.
The following configuration options can be used:
dest:-
key: any ofwork,inputoroutput -
dir: destination dir to store the generated files. -
file: destination file name. Defaults to the action name with the extension associated to the output format. -
format: output format.
-
The action configuration can also be included directly in the qmd
yaml header, under the plpipes branch.
sequence
Runs a set of actions in sequence.
The list of actions to be run are declared as an array under the
sequence setting.
Relative action names (starting by a dot) are also accepted.
Example yaml configuration:
type: sequence
sequence:
- .bar
- miau.gloglo
loop
The loop action is a construct for creating action loops.
It runs a set of subactions in sequence repeatedly according to specified iterators, enabling one to perform repetitive operations following several strategies and with varying parameters.
The configuration specifies the subactions to be executed in the loop and the iterators that control the iterations. These are the accepted keys:
-
sequence: Specifies the names of the subactions to be executed in the loop. The subactions will be executed in the order specified. -
iterator: Specifies the iterators to be used for the loop.Each iterator is defined by a key and its corresponding configuration, which includes the type of the iterator and any required parameters.
The supported iterator types are:
-
values: Iterates over a list of specific values. -
configkeys: Iterates over the keys of a specific path in the configuration.
-
-
ignore_errors(optional): If set totrue, any errors that occur during an iteration will be logged but will not stop the loop. If not specified or set tofalse, an error during iteration will raise an exception and halt the loop.
Sample configuration:
loop:
sequence:
- subaction1
- subaction2
- subaction3
iterator:
one:
type: values
values:
- value1
- value2
- value3
two:
type: configkeys
path: my_config.path
ignore_errors: true
Runner
The purpose of the runner is to offer a unified entry point for the project actions and pipelines.
It extracts information from a set of environment variables and also parses command line arguments in a standard way.
Command line arguments
The accepted command line arguments are as follow:
-
-d,--debug: Sets the logging level to debug. -
-c file,--config file: Reads configuration settings from the given file. -
-s key=value,--set key=value: sets the given configuration. For instance:-s fs.output=/var/storage/ai-output. -
-S key=value, --set-json key=value: parses the given value as JSON an sets the related configuration entry. -
-e env,--env env: defined the deployment environment. -
action1 action2 ...: set of actions to execute.
Environment variables
The following environment variables can be used to configure the framework:
-
PLPIPES_ROOT_DIR: The project root directory. -
PLPIPES_ENV: The deployment environment (usuallyDEV,PREorPRO). -
PLPIPES_LOGLEVEL: The default log level (debug,info,warningorerror).
Under the hood
The runner has two parts. The little run.py script that is just a
wrapper for plpipes.runner.main. and the later which is the real
thing!
run.py is required because plpipes uses that program path to
locate the project root directory and the rest of the files.
Custom scripts
Sometimes you may need to create some custom script out of the actions structure. In that case, you can write it as a custom runner in the following manner:
import plpipes.runner
# Get an argument parser preinitilized.
arg_parser = plpipes.runner.arg_parser()
# Add new options to the argument parser if needed
arg_parser.add_argument(...)
# Parse argument and initialize plpipes
opts = plpipes.runner.parse_args_and_init(arg_parser, sys.argv)
# Your code goes here!!!
For simple cases, where no extra arguments are going to be needed, the
framework also provides a simple_init function:
import plpipes.runner
plpipes.runner.simple_init()
# Your code goes here!!!
Note also that PLPipes uses the script name (actually the stem) as a key when loading configuration files, so it is possible to have different configuration for those scripts which are loaded automatically. See configuration File Structure section above.
Logging
Python standard logging framework is instantiated by the framework and can be used directly from actions code.
If you need some particular configuration not yet supported, just ask for it!
Also, take into account that some python frameworks (for instance, Tensorflow or OpenVINO) unconditionally change or overload python logging on its own.
Automatic file logging
After initialization plpipes automatically creates a new file logger
which saves a copy of the log in the logs directory. Also, on
operating systems supporting symbolic links, it also creates a link
named logs\last_log.txt.
Cloud services
PLPipes provides shortcuts for accessing common cloud services.
Azure
Authentication
Package plpipes.cloud.azure.auth provides an easy way to manage
Azure credentials.
API
Credential objects of type azure.identity.MsalCredential can be
retrieved using function credentials as follows:
import plpipes.cloud.azure.auth
cred = plpipes.cloud.azure.auth.credentials("predictland")
Configuration
Authentication accounts are declared in the configuration files and instantiated by the module on demand (which for some kind of authentication methods may require user interaction).
For instance, the following configuration snippet defines the
authorization account predictland.
cloud:
azure:
auth:
predictland:
type: interactive_browser
tenant_id: 01234567-89ab-cdef-0123-456789abcdef
client_id: 01234567-89ab-cdef-0123-456789abcdef
client_secret: super-super-super-super-secret
authentication_callback_port: 8283
username: elvis@predictland.com
scopes:
- "https://graph.microsoft.com/.default"
The meaning of every key is as follows:
-
type: indicates the type of authentication to be used. It defaults toInteractiveBrowserCredential. -
scopes: the list of scopes (groups of permissions) to be requested. This entry is optional, as most Azure services would re-ask for the credentials with the scopes they need.
Every driver may also accept and/or require additional configuration entries:
interactive_browser:
Launches a browser and lets the use authenticate using her account. Credentials are cached when possible.
-
client_idandclient_secret: are the application credentials which must be registered in Azure Active Directory (AAD). See Register Application at MS Learn website. -
tenant_id: the tenant where the application has been registered. -
username: expected user, optional. Note that when Azure shows the login page to the user, it allows her to login with any account registered in the tenant AD. When this option is used, the framework ensures that the user logs with the expected one. Otherwise it throws an error. -
authentication_callback_port: The framework starts an HTTP server at the given port in order to receive the data from the user browser (afterwards it is stopped). The port must be the same used to register the application in AAD.
az_cli:
Uses Azure command line client (az) for authentication.
Accepted entries are as follows:
-
private(defaults to true): whether to use a privateazconfiguration for this login or the global one for the user.In the later case, the global configuration must be initialized by the user calling
az login.
Google Cloud
Note: This is a work in progress
Authentication
API
Credential objects of type google.auth.credentials.Credentials can
be retrieved using function credentials as follows:
import plpipes.cloud.cloud.auth
cred = plpipes.cloud.cloud.auth.credentials("predictland")
Configuration
Authentication accounts are declared in the configuration files and instantiated by the module on demand (which for some kind of authentication methods may require user interaction).
For instance, the following configuration snippet defines the
authorization account predictland.
cloud:
azure:
auth:
predictland:
type: oauth2
scopes:
- "https://www.googleapis.com/auth/cloud-platform"
...
The meaning of every key is as follows:
type: name of the authentication backend.scopes: list of scope for which access is being requested.
Every backend requires a different set of additional options:
oauth2
-
installed: the additional entries required bygoogle_auth_oauthlib.flow.InstalledAppFlow.from_client_config(client_id,project_id,auth_uri,token_uri,auth_provider_x509_cert_url,client_secretand `redirect_uris').Those options can be retrieved from the JSON file generated by GoogleCloud when a new OAuth2 installed applications is registered (GoogleCloud Console → APIs & Services → Credentials → Create Credentials → OAuth Client ID → Desktop App → Download JSON).
Example:
google:
auth:
predictland:
type: oauth2
installed:
client_id: "..."
project_id: "predictland"
auth_uri: "https://accounts.google.com/o/oauth2/auth"
token_uri: "https://oauth2.googleapis.com/token"
auth_provider_x509_cert_url: "https://www.googleapis.com/oauth2/v1/certs"
client_secret: "..."
redirect_uris: ["http://localhost"]
scopes:
- https://www.googleapis.com/auth/cloud-platform
Microsoft Graph
Interaction with MS Graph API, which provides access to OneDrive,
SharePoint and Teams resources, is available through the package
plpipes.cloud.azure.graph.
API
graph(account_name): returns an object of class msgraph.code.GraphClient. Note that the Python Azure SDK is still in beta, in a state of flush and so, this method may return objects of a different class in the future.
fs(account_name): returns an object that allows to access MS Graph
resources as a file system.
File-system view
The file system facade class exposes MS Graph resources as a file system.
Resources are exposed under different routes as follows:
-
me: Business user OneDrive drive. -
groups: Teams group drives.
The file system objects returned by fs support the following
methods:
-
go(path): You can think of this method as a change dir (cd) operation with the particularity that it also allows one to descend into file-like resources.The returned value is a new file system object with the root at
path. -
ls(path): Return a dictionary of file-name and entry pairs representing the entries under the directorypath. -
names(path): Similar tolsbut returns only the names of the entries. -
is_file()andis_dir(): Determines where the current file system object is pointing to a file or a directory respectively. -
get(path="", dest=None, dir=None, name=None): downloads the remote object pointed by the current file system object.When
destis given it is used as the local destination path.Alternatively, when
destis not given,dirandnamevalues (or their defaults) are used to construct the local destination path.namedefaults to the remote file name.dirdefaults to the working directory (i.e.cfg['fs.work']). -
rget(path="", dest=None, dir=None, name=None): recursively downloads the remote object (typically a directory) to the current file system.
Example usage:
import plpipes.cloud.azure.graph
fs = plpipes.cloud.azure.graph.fs("predictland")
group_drive = fs.go("groups/HAL/General")
group_drive.rget("input-data")
Configuration
Currently, the only supported configuration parameter is credentials
with must be the name of an Azure authentication account defined under
cloud.azure.auth. When not given, it defaults to the one of the same
name.
cloud:
azure:
graph:
predictland:
credentials: predictland
OpenAI (ChatGPT)
PLPipes provides a very thin wrapper for the openai package.
Currently, it just automates the authentication side, reading the API key from the configuration and setting it on the client package.
import plpipes.cloud.openai as openai
completion = openai.Completion.create(...)
If used outside actions, if should be taken into account that
PLPipes config subsystem must be initialized before importing
plpipes.cloud.openai.
Configuration
cloud:
openai:
auth:
api_key: YOUR-SECRET-KEY-GOES-HERE
Network
Clients
PLPipes provides a set of easy-to-use network clients for downloading data from several institutions.
This is still a work in progress.
Eurostat
U.S. Energy Information Administration (EIA)
API key must be dowloaded from
https://www.eia.gov/opendata/register.php and added into the
configuration as net.client.eia.api_key.
European Central Bank (ECB)
U.S. Bureau of Labor Statistics (BLS)
API key must be downloaded from
https://data.bls.gov/registrationEngine/ and added into the
configuration as net.client.us_bls.api_key.
Jupyter integration
PLPipes includes an IPython extension which exposes the framework functionality in Jupyter notebooks.
Initialization
The extension is loaded adding the following lines at the beginning of your notebook:
%load_ext plpipes.jupyter
%plpipes {stem}
Where {stem} is the name used as the main key when looking for
configuration files (defaults to jupyter).
In order to find the project configuration, the extension looks into
the environment variable PLPIPES_ROOT_DIR. If that variable is not
defined then it looks for a config directory in the current working
directory of the IPython kernel (usually the directory from where
jupyter-lab was launched) and walks up the file system until such
directory is found.
Once the extension is loaded and initialized, the features described in the following sections can be used.
Variable, packages and method shortcuts
The following variables and methods are made available in the session:
-
cfg: The configuration object -
input_dir,work_dirandoutput_dir:libpathobjects pointing to the input, work and output directories.For instance:
df = pandas.read_csv(input_dir / "data001.csv")
-
db: a shortcut forplpipes.database -
create_tableandquery: shortcuts for the functions of the same name inplpipes.database.
SQL integration
The IPython SQL extension (see https://pypi.org/project/ipython-sql/)
is automatically loaded and the configured PLPipes work database set
as the default one.
Other databases configured in PLPipes can be selected using a double
at sign (@@) followed by the database name. For instance:
%%sql @@input
select * from customers
limit 100
Packing plpipes
Currently, plpipes is packed with
flit (which can be installed with
the usual pip command: pip install flit).
A python wheel file for plpipes is generated running the following
command from inside plpipes root directory:
flit build
The generated wheel file is placed inside dist. That file is a
standard (pure) Python package that can be installed in anywhere. For
instance:
pip install ../PL-TEC-PLPipes/dist/plpipes-0.1-py2.py3-none-any.whl
FAQ
Design
-
Why is the database used to pass data between actions? Isn't that inefficient?
Usually it is not.
Both SQLite and DuckDB are pretty fast reading and writing data so that the database trip is very rarely the bottleneck.
Actually, if you are able to delegate the data transformation tasks to the database (writing SQL code or using some front-end as ibis), they would perform way faster than the equivalent pandas code.
Coming back to the why. Using a database has several additional benefits:
-
It is quite easy to inspect intermediate data, just point your favorite SQL GUI (for instance, DBeaver) to the database and look at the tables you want to see.
-
It allows the programmer to easily add pre and post-condition checking scripts which unintrusively validate the data before and after every action is run (planed).
-
It allows one to switch between functional-equivalent actions easily. For instance, in order to add support for some new algorithm into a project, all that is required is to develop the new model-training action and to plug it into some pipeline.
-
It becomes easier for new people to get to work into the project, as they only need to understand the data in the tables where they are going to work.
-
It is easy to establish guidelines about documenting the intermediate information structure (something that never happens for in-process pipelines).
-
-
How should I break my program into actions?
Well, the truth is we are still learning about what are the best ways to structure data science projects around actions!
Typically, there are three clear parts in a Data Science project:
- Data preprocessing
- Model training and validation
- Predicting
Though, sometimes, it doesn't make sense to split the training and the prediction stages. For instance, when the model needs to be retrained every time as it happens with time series data.
Then every one of the actions above may be broken in several sub-actions. For instance, as part of the preprocessing we would have a data-retrieving action (maybe composed of several sub-actions as well). And then two more actions for converting from bronze-quality data first to silver and then to gold (see the Medallion architecture).
Then, inside the model training, we could have still some data manipulation actions in order to adapt the generic gold format to the format required by the specific model, then an action that trains and saves the model to disk and finally some action that calculates some KPIs.
Otherwise, maybe for that particular algorithm it is easier to do the data preparation, training and evaluation in just one action.
Note also, that
actionsare not the only available abstraction to be used with PLPipes. Code can be organized as regular Python modules inside thelibdirectory and called from multiple actions.In summary, Common sense should be applied. Actions should not be a straitjacket, but just another element in your tool-set!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file plpipes-0.1.tar.gz.
File metadata
- Download URL: plpipes-0.1.tar.gz
- Upload date:
- Size: 118.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.32.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
496c6161d295c52295dbcb4735271ef3a64016074e7856132c643a238ece2582
|
|
| MD5 |
a0054b5b96b0d7e55be3b67d0f836984
|
|
| BLAKE2b-256 |
228468a046718ed2994e7bc7cb986b1b969252ae7d0f8519b06576d7b18e06d7
|
File details
Details for the file plpipes-0.1-py2.py3-none-any.whl.
File metadata
- Download URL: plpipes-0.1-py2.py3-none-any.whl
- Upload date:
- Size: 111.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.32.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
613160d4ed3fdf2909ebb7337f8a87a3ce3c844ff9cbbc24db99aebf74cc094d
|
|
| MD5 |
e70154564c4bb7dc01e061ee8fb3065d
|
|
| BLAKE2b-256 |
6da7576682e6c073246393185cb5a118aef21add7e326f333d66576ed6b1bf10
|