Skip to main content

Quality Aware Feature Store.

Project description

Quality Aware Feature Store

Build GitHub GitHub release

Simple and scalable feature store with data quality checks.

feature store aim to solve the data management problems when building Machine Learning applications. However the data quality is a component which data teams need integrate and handle as separated component. This project join both concepts keeping the data quality closely coupled with data transformations making necessary a minimal data verification check and possibiliting the data/transformations check evolve during the projects.

For that qafs have a strong dependecy with pandera to build the data validations.

Features

  • Pandas-like API
  • Features information stored in database along with metadata.
  • Dask to process large datasets in a cluster enviroment.
  • Data is stored as timeseries in Parquet format, store in filesystem or object storage services.
  • Store transformations as feature.

Get Started

Installing the python package through pip:

$ pip install qafs

Bellow is an example of usage qafs where we'll create a feature store and register numbers feature and an squared feature transformation. First we need import the packages and create the feature store, for this example we are using sqlite database and persisting the features in the filesystem:

import qafs
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
from pandera import io


fs = qafs.FeatureStore(
    connection_string='sqlite:///test.sqlite',
    url='/tmp/featurestore/example'
)

Features could be stored in namespaces, it help organize the data. When creating numbers we specify the 'example/numbers' feature to point the feature numbersat that namespace example however we can use the arguments name='numbers', namespace='example' as well. The we specify the data validation using pandera telling that feature is Integer and the values should be greater than 0:

fs.create_namespace('example', description='Example datasets')
fs.create_feature(
    'example/numbers',
    description='Timeseries of numbers',
    check=Column(pa.Int, Check.greater_than(0))
)


dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'numbers': list(range(1, len(dts) + 1))})

fs.save_dataframe(df, name='numbers', namespace='example')

To register our squared transformation feature we're using the annotation fs.transform and fetching the data from the numbers feature applying the same data validation from numbers:

@fs.transform(
    'example/squared',
    from_features=['example/numbers'],
    check=Column(pa.Int, Check.greater_than(0))
)
def squared(df):
    return df ** 2

When fetch our features we should see:

df_query = fs.load_dataframe(
    ['example/numbers', 'example/squared'], 
    from_date='2021-01-01',
    to_date='2021-01-31'
)
print(df_query.tail(1))
##----
#             example/numbers  example/squared
# time                                        
# 2021-01-31              397           157609
##----

Contributing

Please follow the Contributing guide.

License

GPL-3.0 License

This project started using the as base bytehub feature store and is under the same license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qafs-0.1.1.tar.gz (30.4 kB view details)

Uploaded Source

Built Distribution

qafs-0.1.1-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file qafs-0.1.1.tar.gz.

File metadata

  • Download URL: qafs-0.1.1.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for qafs-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ef73103237283a6837c83efb88c2373e6ffe1b39d66f7defc637e5df7fb531d1
MD5 abfdc883e831facbc8d2052cdb7011b0
BLAKE2b-256 6488a28304cc0aa692fa2db714df21d910030befeb1fee20f3ef9480a4b759eb

See more details on using hashes here.

File details

Details for the file qafs-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: qafs-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 48.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for qafs-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 069f6326521e128e1dc5b091ed5ac371c8d2a4127cee084de1f7afce6c7f495c
MD5 ab19107e280d20232c92d9fe23911615
BLAKE2b-256 7cdc3f82692684279e5a2aabddf7b1986b7e5bea6b022114c6e5b7544e2a9547

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page