Skip to main content

Git based metric store

Project description

Datagit

DataGit version DataGit monthly downloads

Datagit is a git based metric store library

>>> from datagit import github_connector
>>> from github import Github

>>> dataframe = bigquery.Client().query(query).to_dataframe()
{"unique_key": ['2022-01-01_FR', '2022-01-01_GB'...
>>> github_connector.store_metric(ghClient=Github("Token"), dataframe=dataframe, filepath="Samox/datagit/data/act_metrics_finance/mrr.csv", assignees=["Samox"])
'🎉 data/act_metrics_finance/mrr.csv Successfully stored!'
'💩 Historical data change detected, Samox was assigned to it'

Purpose

Non-moving data is a journey, in reality, the data moves, or drifts. The purpose of this library is

  • to parse, sort, sanitize a metric dataset
  • to convert it to CSV
  • then to store it in a Github repository with clean commits for new data, or drifting data.

Getting Started

To get started with Datagit, follow these steps:

  1. Create a new repository on GitHub called datagit (or whatever other name you prefer) with a README file.
  2. Generate a personal access token on GitHub that has access to the datagit repository. You can do this by going to your GitHub settings, selecting "Developer settings", and then "Personal access tokens". Click "Generate new token" and give it the necessary permissions (content and pull requests).
  3. In your data pipelines, when relevant, call store_metric with the following parameters
    • a github client with your token Github("Token")
    • your metric in a dataframe format
    • the path of metric in a with a csv format: "your_orga/your_repo/path/to/your.csv"
    • The owner of the metric

For instance

>>> from datagit import github_connector
>>> github_connector.store_metric(ghClient=Github("Token"), dataframe=dataframe, filepath="Samox/datagit/data/act_metrics_finance/mrr.csv", assignee=["Samox"])

That's it! With these steps, you can start using Datagit to store and track your metrics over time.

Example

>>> githubToken = "github_pat****"
>>> githubRepo = "ReplaceOrgaName/ReplaceRepoName"
>>> import pandas as pd
>>> from datetime import datetime
>>> dataframe = pd.DataFrame({'unique_key': ['a', 'b', 'c'], 'date': [datetime(2023,9,1), datetime(2023,9,1), datetime(2023,9,1)], 'amount': [1001, 1002, 1003], 'is_active': [True, False, True]})
>>> from github import Github
>>> from datagit import github_connector
>>> github_connector.store_metric(ghClient=Github(githubToken), dataframe=dataframe, filepath=githubRepo+"data/act_metrics_finance/mrr.csv")

Dataset

Datagit is base on the standard dataframe format from Pandas. One can use any library to get the data as long as the format fits the following requirements:

  1. The first column of the dataframe must be unique_key
  2. The first columns must have only unique keys
  3. The second column must be a date

The granularity of the dataframe depends on every use case:

  • it can be at very low level (like transaction) or aggregated (like a metric)
  • it can contain all the dimension, or none

1st column: Unique key

The unique_key is used to detect a modification in historical data

In case you have duplicated lines, datagit will automatically rename them with -duplicate-n

  unique_key  value
0          A     10
1          B     20
2          C     30
3          B     40
4          C     50
5          C     60
6          D     70
         unique_key  value
0                A     10
1                B     20
2                C     30
3    B-duplicate-1     40
4    C-duplicate-1     50
5    C-duplicate-2     60
6                D     70

2nd column: Date

The date key is used to detect new historical data, or deleted historical data

Query Builder

Datagit provides a simple query builder to store a table:

>>> from datagit import query_builder
>>> query = query_builder.build_query(table_id="my_table", unique_key_columns=["organisation_id", "date_month"], date="date_month")
'SELECT CONCAT(organisation_id, '__', date_month) AS unique_key, date_month as date, * FROM my_table WHERE TRUE ORDER BY 1'

More examples here

Large Dataset

Partitionning

In case of more than 1M rows, partitionning is recomanded using the partition_and_store_table function.

>>> from datagit import github_connector

>>> very_large_dataframe = bigquery.Client().query(query).to_dataframe()
{"unique_key": ['2022-01-01_FR', '2022-01-01_GB'...
>>> github_connector.partition_and_store_table(ghClient=Github("Token"), dataframe=very_large_dataframe, filepath="Samox/datagit/data/act_metrics_finance/mrr.csv")
'🎁 Partitionning data/act_metrics_finance/mrr.csv...'

Drift

A drift is a modification of historical data. It can be a modification, addition or deletion in a table that is supposed to be "non-moving data".

Drift evaluator

When a drift is detected, the default behaviour is to trigger an alert and prompt the user to explain the drift before merging it to the dataset. But a custom function can be used to decide weather an alert should be triggered, or if the drift should be merged automatically.

Default drift evaluator

The default drift evaluator will open a pull request with a message containing the number of addition, modifications and deletions of the drift.

Custom drift evaluator

You can provide a custom evaluator which is a function with the following properties:

  • parameters:
    • `data_drift_context``: a dictionnary with:
    • computed_dataframe (the metric up to date)
    • reported_dataframe (the metric already reported)
  • return value:
    • A dictionnary containing:
    • "should_alert": Boolean, If True a pull request will be opened, If False the drift will be merged
    • "message": str, the message to display in the pull request, or the body message of the drift commit

No alert drift evaluator

In case you just want to store the metric in a git branch, this drift evaluator merge the drift in the reported branch without any alert.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datagit-0.18.6.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

datagit-0.18.6-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file datagit-0.18.6.tar.gz.

File metadata

  • Download URL: datagit-0.18.6.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/6.0.0 pkginfo/1.9.6 requests/2.29.0 requests-toolbelt/0.9.1 tqdm/4.65.0 CPython/3.11.3

File hashes

Hashes for datagit-0.18.6.tar.gz
Algorithm Hash digest
SHA256 67a747ef344a7e4bf542b4a325afacbea319c2d089cf7f4a67ed041e98c2ae8a
MD5 e05b8a4753ac140b0db562b2200957d9
BLAKE2b-256 e214de29bb3d95099062352478c04439394cd4830e3a9f64be008aa7219584e1

See more details on using hashes here.

File details

Details for the file datagit-0.18.6-py3-none-any.whl.

File metadata

  • Download URL: datagit-0.18.6-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/6.0.0 pkginfo/1.9.6 requests/2.29.0 requests-toolbelt/0.9.1 tqdm/4.65.0 CPython/3.11.3

File hashes

Hashes for datagit-0.18.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b04a08f3535a328bf34af046a61ddc36521ae5a9a93dc1b751c86c1879cbf648
MD5 683b3443743f2b0b90bbe4d9ecbbbce0
BLAKE2b-256 2b74729e1e89adcd6b4efe24d1918763c8dc2e3a1c86e9e237597ef7fd744057

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page