Standardise model training across libraries
Project description
Pytalog
This project focusses on making data catalogs, allowing you to define a jinja-temlated YAML file that you can use to read in data. This has a few advantages:
- A single way of reading (and writing) your datasets, without having to know how exactly they are read, allows people to more easily onboard on new projects, since all datasets are defined in the catalog YAML.
- Templating allows you to swap between different environments, configurations, etc. without having to change your catalog file. This setup is hierarchical, allowing top-level config yamls to template lower levels, creating a manageable hierarchy of configuration.
Usage
The Catalog file
You can define a catalog YAML file as a starting point. The structure is as follows:
<dataset name>:
callable: <python path to class, method on a class, or function. Should produce a pytalog.base.data_sources.DataSource.
args:
<name of argument 1>: <value 1>
<name of argument 2>: <value 2>
validations:
- callable: <python path to a ValidatorObject class or function that takes a dataset and kwargs>
args:
<name of argument>: <value>
The core consists of :
- The name of the dataset. This will be used later to read in your data.
- The
callable
. This is a path to a python importable class or function. When called with the args inargs
, it should return aDataSource
. - The
args
. These are values that will be passed to thecallable
to create the finalDataSource
. Values can be a plain value (like string, bool, etc) or a complex object. These will follow the same format as the callable-args structure of the DataSource's, but don't need to be DataSource's. They can be arbitrary python objects instead.
DataSource
s are not the data itself: they are the method for obtaining the data. For instance, a SQL based DataSource
would contain
the SQL query, database connection string, etc, but it would only execute
the query when you actually call the .read()
method on the class. This
allows you to make a very lightweight and quickly imported catalog that has access to all the data in your project.
The data validations are an optional part. If present, these will be executed every time you read the dataset. Consider them a useful tool to
preemptively find potential issues with your data before doing anything else. They should all take one dataset as input and extra keyword arguments
(defaults) can be provided using the args
part.
An example of a catalog is the following:
pandas_sql:
# please make sure the class is an importable callable
# this can be a class constructor or a function.
callable: pytalog.pd.data_sources.SqlSource
args:
sql: select * from database.table
con: http://<your database url>
dataframe:
# This will create a fixed DataFrame to read using this source.
callable: pytalog.pd.data_sources.DataFrameSource
args:
df:
# We can instantiate complex objects in the same way
# DataSource's are instantiated:
callable: pandas.DataFrame
args:
data:
x:
- 1.0
- 2.0
y:
- "a"
- "b"
These catalog can be
Reading in a Catalog
Say you have stored a catalog.yaml
file like the one above. Then you can create a catalog and read data from it using:
from pytalog.base.catalog import Catalog
catalog = Catalog.from_yaml("catalog.yaml")
# To read a dataset:
data = catalog.read("dataframe")
Adding configuration
It is quite common to have different locations, paths, URL, etc depending on a dev/prod environment configuration. You can also imagine other parts of your solution that can change where data can be found or how it should be read in, like a country-level configuration. For this, pytalog
gives options as well through Jinja templating:
# The catalog:
pandas_sql:
# please make sure the class is an importable callable
# this can be a class constructor or a function.
callable: pytalog.pd.data_sources.SqlSource
args:
sql: select * from database.table
con: {{ database_connection }}
dataframe:
# This will create a fixed DataFrame to read using this source.
callable: pytalog.pd.data_sources.DataFrameSource
args:
df:
# We can instantiate complex objects in the same way
# DataSource's are instantiated:
callable: pandas.DataFrame
args:
data:
x:
- 1.0
- {{dataframe.x}}
y:
- "a"
- "b"
And in python:
from pytalog.base.catalog import Catalog
catalog = Catalog.from_yaml(
"catalog.yaml",
parameters={
"database_connection": "http://connection",
"dataframe": {"x": 10}
}
)
# To read a dataset:
data = catalog.read("dataframe")
The values you provide in parameters
are automatically templated into the catalog file before loading it in. Keep in mind that this method is limited to the complexity that Jinja allows: complex Python objects can't be effectively passed to the parameters
argument since they'll be stringified and inserted into the YAML file before being read back in.
The Configuration object
This parameters
would still leave you with the responsibility of reading in the parameters, doing any hierarchical processing, and then passing it all on to the Catalog object. To solve this, there is the Configuration object, which you can provide with multiple configuration YAML files and a catalog file. In turn, it provides you with the hierarchically templated configuration object and resulting Catalog:
# Environment config
env: dev
database_connection: http://dev-database.now
# Base configuration that uses info from the environment config
database_name: database-{{ env }}
dataframe:
x: 2.5
# The catalog:
pandas_sql:
# please make sure the class is an importable callable
# this can be a class constructor or a function.
callable: pytalog.pd.data_sources.SqlSource
args:
sql: select * from {{ database_name }}.table
con: {{ database_connection }}
dataframe:
# This will create a fixed DataFrame to read using this source.
callable: pytalog.pd.data_sources.DataFrameSource
args:
df:
# We can instantiate complex objects in the same way
# DataSource's are instantiated:
callable: pandas.DataFrame
args:
data:
x:
- 1.0
- {{dataframe.x}}
y:
- "a"
- "b"
This can be loaded in using the Configuration object:
from pytalog.base.configuration.config import Configuration
# This will contain 2 objects:
# - a config dictionary, for all (parsed) configuration
# - a Catalog object based on the catalog.yaml
cf = Configuration.from_hierarchical_config(
parameters_paths=["environment.yaml", "base.yaml"],
catalog_path="catalog.yaml",
)
catalog = cf.catalog
# To read a dataset:
data = catalog.read("dataframe")
If you need any objects that are hard to instantiate in this way, you can also provide them as parameters to either Catalog
's from_yaml
or Configuration
's from_hierarchical_config
using the initialised_parameters
argument. This allows you to provide a dictionary of python objects that will be inserted into the argument list of any Callable that requires it and doesn't have an argument for it yet at instantiation.
I'd recommend using the Configuration object to get started to give you as much flexibility as possible when using your catalog file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pytalog-0.1.1.tar.gz
.
File metadata
- Download URL: pytalog-0.1.1.tar.gz
- Upload date:
- Size: 5.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7babac950e86fae1f904a69e882a058b8d7ce250ef155deb6e8c6a0c7287ccad |
|
MD5 | 6bd70c85544e6e3c732bf9b295c2e110 |
|
BLAKE2b-256 | e22542805cb08c3beaff7cf8008218dcad65cb365012b3a266cac7762e845ae9 |
File details
Details for the file pytalog-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: pytalog-0.1.1-py3-none-any.whl
- Upload date:
- Size: 4.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01f6c0e662d5f5bfe61913a9042e822eb50e21d753fa8f59f7f18c51659a4238 |
|
MD5 | bb93bea4f926bc919045bda74f8783cf |
|
BLAKE2b-256 | 77e16c80cc4b4fd182f93dc68e5a302262c946dc5cfb5e5dfe18a6fd94bb7bfc |