Quality Aware Feature Store.
Project description
Quality Aware Feature Store
Simple and scalable feature store with data quality checks.
feature store aim to solve the data management problems when building Machine Learning applications. However the data quality is a component which data teams need integrate and handle as separated component. This project join both concepts keeping the data quality closely coupled with data transformations making necessary a minimal data verification check and possibiliting the data/transformations check evolve during the projects.
For that qafs have a strong dependecy with pandera to build the data validations.
Features
- Pandas-like API
- Features information stored in database along with metadata.
- Dask to process large datasets in a cluster enviroment.
- Data is stored as timeseries in Parquet format, store in filesystem or object storage services.
- Store transformations as feature.
Get Started
Installing the python package through pip:
$ pip install qafs
Bellow is an example of usage qafs where we'll create a feature store and register numbers
feature and an squared
feature transformation. First we need import the packages and create the feature store, for this example we are using sqlite database and persisting the features in the filesystem:
import qafs
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
from pandera import io
fs = qafs.FeatureStore(
connection_string='sqlite:///test.sqlite',
url='/tmp/featurestore/example'
)
Features could be stored in namespaces, it help organize the data. When creating numbers
we specify the 'example/numbers'
feature to point the feature numbers
at that namespace example
however we can use the arguments name='numbers', namespace='example'
as well. The we specify the data validation using pandera telling that feature is Integer
and the values should be greater than 0
:
fs.create_namespace('example', description='Example datasets')
fs.create_feature(
'example/numbers',
description='Timeseries of numbers',
check=Column(pa.Int, Check.greater_than(0))
)
dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'numbers': list(range(1, len(dts) + 1))})
fs.save_dataframe(df, name='numbers', namespace='example')
To register our squared
transformation feature we're using the annotation fs.transform
and fetching the data from the numbers
feature applying the same data validation from numbers
:
@fs.transform(
'example/squared',
from_features=['example/numbers'],
check=Column(pa.Int, Check.greater_than(0))
)
def squared(df):
return df ** 2
When fetch our features we should see:
df_query = fs.load_dataframe(
['example/numbers', 'example/squared'],
from_date='2021-01-01',
to_date='2021-01-31'
)
print(df_query.tail(1))
##----
# example/numbers example/squared
# time
# 2021-01-31 397 157609
##----
Contributing
Please follow the Contributing guide.
License
This project started using the as base bytehub feature store and is under the same license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file qafs-0.1.1.tar.gz
.
File metadata
- Download URL: qafs-0.1.1.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef73103237283a6837c83efb88c2373e6ffe1b39d66f7defc637e5df7fb531d1 |
|
MD5 | abfdc883e831facbc8d2052cdb7011b0 |
|
BLAKE2b-256 | 6488a28304cc0aa692fa2db714df21d910030befeb1fee20f3ef9480a4b759eb |
File details
Details for the file qafs-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: qafs-0.1.1-py3-none-any.whl
- Upload date:
- Size: 48.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 069f6326521e128e1dc5b091ed5ac371c8d2a4127cee084de1f7afce6c7f495c |
|
MD5 | ab19107e280d20232c92d9fe23911615 |
|
BLAKE2b-256 | 7cdc3f82692684279e5a2aabddf7b1986b7e5bea6b022114c6e5b7544e2a9547 |