DataCockpit is a package to analyze the usage and quality metrics of your database.
Project description
DataCockpit
DataCockpit is a Python toolkit that leverages the data, usage logs, and associated meta-data within a data lake and provisions two kinds of auxiliary information:
- quality – the validity and appropriateness of data required to perform certain analytic tasks.
- usage – the historical utilization characteristics of data across multiple users.
Architecture
Install
DataCockpit is written in Python 3. Please ensure you have a Python 3 environment already installed.
Installing DataCockpit is as simple as making microwave popcorn! Just pip install datacockpit
and then sit back and let it do the work while you put your feet up.
How to Use
from datacockpit import DataCockpit
# Setup database connection with SQL engine and log-file CSV
dcp_obj = DataCockpit(engine=your_sqlalchemy_engine,
logs_path="your/logs/path.csv")
# Compute and persist quality & usage metrics
dcp_obj.compute_quality(levels=None, metrics=None)
dcp_obj.compute_usage(levels=None, metrics=None)
- First, we initialize a SQLAlchemy
engine
that is used to connect to the database. - Then we create an object,
dcp_obj
, of theDataCockpit
class, by passing theengine
andlogs_path
as arguments.logs_path
points to the location where the historical usage logs (SQL queries and metadata such as the user who ran them and the timestamp) are saved in a CSV file. - The next two lines call methods to compute and persist quality and usage with parameters to support different
levels
(e.g., ['attribute', 'record', 'all']) andmetrics
(e.g., ['correctness', 'completeness', 'all']). - The
compute_
commands persist the computed metrics to database tables. - You can retrieve the computed metrics for use in downstream applications through the below
get_
commands.
# Retrieve computed information for use in downstream applications
quality_info = dcp_obj.get_quality()
usage_info = dcp_obj.get_usage()
Depending on your data and query patterns, the get_quality()
and get_usage()
functions will return the following quality and usage information.
Data Quality Information
Attribute metrics table (attribute_metrics):
Completeness
is the percentage of non-missing values for an attribute.Correctness
is the percentage of correct values for an attribute based on pre-defined constraints.Objectivity
is the amount of distortion in the data distribution.Uniqueness
is the percentage of unique values for an attribute.
Record metrics table (record_metrics):
Completeness
is the percentage of non-missing values in each dataset record.Correctness
is the percentage of correct values in each dataset record.Uniqueness
is the percentage of unique values in each dataset record.
Data Usage Information
The usage metrics are fairly self explanatory. The SQL queries are parsed to get the Metadata Table that shows usage statistics for every attribute in the datasets (analogous to tables). The Aggregate Table and the Dataset Usage Tables are rolled up from the Metadata Table. Other analyses such as timeseries analyses are shown in the Jupyter notebooks in the assets/notebooks
directory. It runs on a sample usage file available in assets/data/query_logs.csv
.
Metadata table (dcp_metadata):
Aggregate table (dcp_aggr):
Data usage (dcp_dataset_usage)
Build
DataCockpit can be installed as a Python package and imported in your own awesome applications!
- DataCockpit is written in Python 3. Please ensure you have a Python 3 environment already installed.
- Clone this repository (review branch) and enter (
cd
) into it. - Create a new virtual environment,
virtualenv --python=python3 venv
- Activate it using,
source venv/bin/activate
(MacOSX/Linux) orvenv\Scripts\activate.bat
(Windows) - Install dependencies,
python -m pip install -r requirements.txt
- <make your changes>
- Bump up the version in setup.py and create a Python distributable,
python setup.py sdist
- This will create a new file inside datacockpit-..*.tar.gz inside the
dist
directory. - Install the above file in your Python environment using,
python -m pip install <PATH-TO-datacockpit-*.*.*.tar.gz>
- Verify by opening your Python console and importing it:
>>> from datacockpit import DataCockpit
- Enjoy, DataCockpit is now available for use as a Python package!
Analysis
Look at assets for examples of how to use the metrics obtained, such as visualizing the temporal trends in data or finding the most critical attributes.
Credits
DataCockpit was created by Arpit Narechania, Fan Du, Atanu R. Sinha, Ryan A. Rossi, Jane Hoffswell, Shunan Guo, Eunyee Koh, Surya Chakraborty, Shivam Agarwal, Shamkant B. Navathe, and Alex Endert.
License
The software is available under the MIT License.
Contact
If you have any questions, feel free to open an issue or contact Arpit Narechania, Surya Chakraborty (chakraborty [at] gatech.edu), or Shivam Agarwal (s.agarwal [at] gatech.edu).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file datacockpit-0.2.0.tar.gz
.
File metadata
- Download URL: datacockpit-0.2.0.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c63dcc9d38e7bc9eeb271649dbfe902519e3e70a33e360f9895ccdcf05c2b1b9 |
|
MD5 | 0bd72848e679f4a3390c33f172abf9fd |
|
BLAKE2b-256 | 326ce4be5d6a62bf3738fa8f0134316565d0a831e24e18df8fa93687f3b2dbb1 |