Historical metric store

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Driftdb

Driftdb is a historical metric store

from driftdb.connectors import GithubConnector
from github import Github

github_connector = GithubConnector(github_client=Github("gh_token"), github_repository_name="org/repo")

dataframe = bigquery.Client().query(query).to_dataframe()
{"unique_key": ['2022-01-01_FR', '2022-01-01_GB'...

github_connector.snapshot_table(table_dataframe=dataframe, table_name="revenue")
'🎉 data/act_metrics_finance/mrr.csv Successfully stored!'
'💩 Historical data change detected, Ammy was assigned to it'

Purpose

Non-moving data is a journey, in reality, the data moves and it has many impacts (Data Integrity and Reconciliation, Predictive Modeling, Historical Data Accuracy) The purpose of this library is:

to snapshot the data, and parse the diff in chunks (schema update, new data collection, data duplication, drift...)
to store it using a connector
to trigger alerts

Getting Started (with Github as a store, it's free)

To get started with Driftdb, follow these steps:

Create a new repository on GitHub called datadrift (or whatever other name you prefer) with a README file.
Generate a personal access token on GitHub that has access to the datadrift repository. You can do this by going to your GitHub settings, selecting "Developer settings", and then "Personal access tokens". Click "Generate new token" and give it the necessary permissions (content and pull requests).
In your data pipelines, when relevant, call snapshot_table with the following parameters
- a connector (in this example a github connector)
- your table in a dataframe format
- the name of the table: "kpi/my_kpi"

For instance

>>> from driftdb.connectors import GithubConnector
>>> github_connector = GithubConnector(github_client=Github("gh_token"), github_repository_name="org/repo")
>>> github_connector.snapshot_table(table_dataframe=dataframe, table_name="revenue")

That's it! With these steps, you can start using Driftdb to store and track your metrics over time.

Dataframe

Driftdb is base on the standard dataframe format from Pandas. One can use any library to get the data as long as the format fits the following requirements:

The first column of the dataframe must be unique_key
The first columns must have only unique keys
The second column must be a date (which is the collection date: the booking_date, the order_date etc)

The granularity of the dataframe depends on every use case:

it can be at very low level (like transaction) or aggregated (like a metric)
it can contain all the dimension, or none

1st column: Unique key

The unique_key is used to detect a modification in historical data

In case you have duplicated lines, driftdb will automatically rename them with -duplicate-n

  unique_key  value
0          A     10
1          B     20
2          C     30
3          B     40
4          C     50
5          C     60
6          D     70

         unique_key  value
0                A     10
1                B     20
2                C     30
3    B-duplicate-1     40
4    C-duplicate-1     50
5    C-duplicate-2     60
6                D     70

2nd column: Date

The date key is used to detect new historical data, or deleted historical data. And differentiate if a new batch is being collected (which won't be a drift)

Large Dataset

Partitionning

In case of more than 1M rows, partitionning is recomanded using the partition_and_store_table function.

>>> from driftdb.connectors.workflow import partition_and_snapshot_table

>>> very_large_dataframe = bigquery.Client().query(query).to_dataframe()
{"unique_key": ['2022-01-01_FR', '2022-01-01_GB'...
>>> connector.partition_and_snapshot_table(table_dataframe=very_large_dataframe, table_name="act_metrics_finance/mrr")
'🎁 Partitionning data/act_metrics_finance/mrr.csv...'

Drift

A drift is a modification of historical data. It can be a modification, addition or deletion in a table that is supposed to be "non-moving data".

Drift Evaluator

A drift evaluator is a class that implement the following abstract class:

class BaseDriftEvaluator(ABC):
    @staticmethod
    @abstractmethod
    def compute_drift_evaluation(
        data_drift_context: DriftEvaluatorContext,
    ) -> DriftEvaluation:
        pass

class DriftEvaluation(TypedDict):
    should_alert: bool
    message: str

Default Drift Evaluator

The default drift evaluator will return should_alert = False

Alert Drift Evaluator

The Alert drift evaluator will reuturn should_alert = True for all drifts and a message containing the summary of the drift, example:

Drift detected:
- 🆕 0 addition
- ♻️ 2 modifications
- 🗑️ 0 deletion

To use the AlertDriftEvaluator, add it when you call snapshot_table like this:

from driftdb.drift_evaluator.drift_evaluators import AlertDriftEvaluator

connector.snapshot_table(table_dataframe, table_name, drift_evaluator=AlertDriftEvaluator)

Custom Drift Evaluator

You can provide a custom evaluator which is a function with a DriftEvaluatorContext containing the following properties:

class DriftEvaluatorContext(TypedDict):
    before: pd.DataFrame
    after: pd.DataFrame
    summary: DriftSummary

class DriftSummary(TypedDict):
    added_rows: pd.DataFrame
    deleted_rows: pd.DataFrame
    modified_rows_unique_keys: pd.Index
    modified_patterns: pd.DataFrame

Then implement your class, and use it in snapshot_table.

class MyDriftEvaluator(BaseDriftEvaluator):
    @staticmethod
    def compute_drift_evaluation(
        data_drift_context: DriftEvaluatorContext,
    ) -> DriftEvaluation:
        # do what you want
        if there_is_something_I_dont_like:
          return {"should_alert": True, "message": "No this should not happen"}
        return {"should_alert": False, "message": ""}

CLI

Instead of storing data on github, you can store data locally and explore it with the cli.

Getting started

From dbt snapshot (dbt >= 1.6)

pip install driftdb

driftdb dbt snapshot
driftdb start

From generated seeds

pip install driftdb

driftdb seed create
driftdb seed update

driftdb start

Features

Metrics

Load a csv

driftdb load-csv path/to/csv

Data visualization

driftdb start

Start the driftdb, and navigate to localhost:9741/tables. Visualize how a metric evolved, given a period, in a waterfall chart.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.4

Feb 2, 2024

0.1.3a13 pre-release

Feb 2, 2024

0.1.3a12 pre-release

Feb 2, 2024

0.1.3a10 pre-release

Feb 2, 2024

0.1.3a9 pre-release

Feb 1, 2024

0.1.3a8 pre-release

Jan 23, 2024

0.1.3a7 pre-release

Jan 23, 2024

0.1.3a6 pre-release

Jan 23, 2024

0.1.3a5 pre-release

Jan 23, 2024

0.1.3a4 pre-release

Jan 23, 2024

0.1.3a3 pre-release

Jan 23, 2024

0.1.3a2 pre-release

Jan 23, 2024

0.1.3a1 pre-release

Jan 23, 2024

0.1.2

Jan 23, 2024

0.1.1

Jan 22, 2024

0.1.0

Jan 22, 2024

0.0.10

Jan 10, 2024

0.0.10a5 pre-release

Jan 9, 2024

0.0.10a4 pre-release

Jan 3, 2024

0.0.10a3 pre-release

Jan 3, 2024

0.0.10a2 pre-release

Jan 3, 2024

0.0.10a1 pre-release

Dec 26, 2023

0.0.9a2 pre-release

Dec 19, 2023

0.0.9a1 pre-release

Dec 19, 2023

0.0.8

Dec 18, 2023

0.0.8a3 pre-release

Nov 30, 2023

0.0.8a2 pre-release

Nov 30, 2023

This version

0.0.8a1 pre-release

Nov 30, 2023

0.0.7

Nov 30, 2023

0.0.6

Nov 30, 2023

0.0.6a5 pre-release

Nov 29, 2023

0.0.6a4 pre-release

Nov 29, 2023

0.0.6a1 pre-release

Nov 23, 2023

0.0.5

Nov 23, 2023

0.0.5a5 pre-release

Nov 22, 2023

0.0.5a4 pre-release

Nov 22, 2023

0.0.5a3 pre-release

Nov 22, 2023

0.0.5a2 pre-release

Nov 22, 2023

0.0.5a1 pre-release

Nov 21, 2023

0.0.4

Nov 20, 2023

0.0.4a1 pre-release

Nov 20, 2023

0.0.3

Nov 20, 2023

0.0.3a2 pre-release

Nov 20, 2023

0.0.3a1 pre-release

Nov 20, 2023

0.0.2

Nov 17, 2023

0.0.2a2 pre-release

Nov 17, 2023

0.0.2a1 pre-release

Nov 17, 2023

0.0.1

Nov 17, 2023

0.0.1a18 pre-release

Nov 9, 2023

0.0.1a17 pre-release

Nov 9, 2023

0.0.1a16 pre-release

Nov 9, 2023

0.0.1a15 pre-release

Nov 9, 2023

0.0.1a14 pre-release

Nov 9, 2023

0.0.1a12 pre-release

Oct 31, 2023

0.0.1a11 pre-release

Oct 31, 2023

0.0.1a10 pre-release

Oct 31, 2023

0.0.1a9 pre-release

Oct 25, 2023

0.0.1a8 pre-release

Oct 20, 2023

0.0.1a7 pre-release

Oct 19, 2023

0.0.1a6 pre-release

Oct 19, 2023

0.0.1a5 pre-release

Oct 19, 2023

0.0.1a4 pre-release

Oct 19, 2023

0.0.1a3 pre-release

Oct 19, 2023

0.0.1a0 pre-release

Oct 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

driftdb-0.0.8a1.tar.gz (39.9 MB view hashes)

Uploaded Nov 30, 2023 Source

Built Distribution

driftdb-0.0.8a1-py3-none-any.whl (40.0 MB view hashes)

Uploaded Nov 30, 2023 Python 3

Hashes for driftdb-0.0.8a1.tar.gz

Hashes for driftdb-0.0.8a1.tar.gz
Algorithm	Hash digest
SHA256	`746a94d24ada70a1045af3271a4794ce4c0398f8ad7e202242a8979ad018fc73`
MD5	`54ae8b6b2c0867b3a553c21974c731f2`
BLAKE2b-256	`c6218a41eb6fc93be6c3ecf3f71b44115ed6a141c34dc682d0bf37829ce59afe`

Hashes for driftdb-0.0.8a1-py3-none-any.whl

Hashes for driftdb-0.0.8a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5600bb63967fb35fec29eef9d27b6c983c58546520e6597360d3f78a56f5f90`
MD5	`f3f62c50521e16b1ed20ef882e2fe976`
BLAKE2b-256	`64b6a658c77735eb8d7f81fe8fe3bf6a49c37c2a4f84e4101bf8b8f193b6d63a`

driftdb 0.0.8a1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Driftdb

Purpose

Getting Started (with Github as a store, it's free)

Dataframe

1st column: Unique key

2nd column: Date

Large Dataset

Partitionning

Drift

Drift Evaluator

Default Drift Evaluator

Alert Drift Evaluator

Custom Drift Evaluator

CLI

Getting started

From dbt snapshot (dbt >= 1.6)

From generated seeds

Features

Metrics

Load a csv

Data visualization

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution