Skip to main content

Data Quality Framework provides by Jabar Digital Service

Project description

DataSae

Docs License PyPI - Python Version PyPI - Version GitHub Action Coverage

Data Quality Framework provides by Jabar Digital Service

Configuration Files

https://github.com/jabardigitalservice/DataSae/blob/46ef80072b98ca949084b4e1ae50bcf23d07d646/tests/data/config.json#L1-L183

https://github.com/jabardigitalservice/DataSae/blob/46ef80072b98ca949084b4e1ae50bcf23d07d646/tests/data/config.yaml#L1-L120

Checker for Data Quality

[!NOTE]
You can use DataSae Column's Function Based on Data Type for adding column checker function data quality in the config file.

pip install 'DataSae[converter,gsheet,s3,sql]'

Python Code

from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# Check all data qualities on configuration
config.checker  # dict result

# Check data quality by config name
config('test_local').checker  # list of dict result
config('test_gsheet').checker  # list of dict result
config('test_s3').checker  # list of dict result
config('test_mariadb_or_mysql').checker  # list of dict result
config('test_postgresql').checker  # list of dict result

Example results: https://github.com/jabardigitalservice/DataSae/blob/46ef80072b98ca949084b4e1ae50bcf23d07d646/tests/data/checker.json#L1-L432

Command Line Interface (CLI)

datasae --help
 
 Usage: datasae [OPTIONS] FILE_PATH
 
 Checker command.
 Creates checker result based on the configuration provided in the checker section of the data source's configuration file.
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    file_path      TEXT  The source path of the .json or .yaml file [default: None] [required]                                    │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --config-name                       TEXT  If the config name is not set, it will create all of the checker results [default: None] │
│ --yaml-display    --json-display          [default: yaml-display]                                                                  │
│ --save-to-file-path                 TEXT  [default: None]                                                                          │
│ --help                                    Show this message and exit.                                                              │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Example commands:

datasae DataSae/tests/data/config.yaml # Check all data qualities on configuration
datasae DataSae/tests/data/config.yaml --config-name test_local # Check data quality by config name

[!TIP] Actually, we have example for DataSae implementation in Apache Airflow, but for now it is for private use only. Internal developer can see it at this git repository.

Converter from Any Data Source to Pandas's DataFrame

[!NOTE]
Currently support to convert from CSV, JSON, Parquet, Excel, Google Spreadsheet, and SQL.

pip install 'DataSae[converter]'

Local Computer

from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# Local computer file to DataFrame
local = config('test_local')

df = local('path/file_name.csv', sep=',')
df = local('path/file_name.json')
df = local('path/file_name.parquet')
df = local('path/file_name.xlsx', sheet_name='Sheet1')

df = local('path/file_name.csv')  # Default: sep = ','
df = local('path/file_name.json')
df = local('path/file_name.parquet')
df = local('path/file_name.xlsx')  # Default: sheet_name = 'Sheet1'

Google Spreadsheet

https://github.com/jabardigitalservice/DataSae/blob/4308324d066c6627936773ab2d5b990adaa60100/tests/data/creds.json#L1-L12

pip install 'DataSae[converter,gsheet]'
from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# Google Spreadsheet to DataFrame
gsheet = config('test_gsheet')
df = gsheet('Sheet1')
df = gsheet('Sheet1', 'gsheet_id')

S3

pip install 'DataSae[converter,s3]'
from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# S3 object to DataFrame
s3 = config('test_s3')

df = s3('path/file_name.csv', sep=',')
df = s3('path/file_name.json')
df = s3('path/file_name.parquet')
df = s3('path/file_name.xlsx', sheet_name='Sheet1')

df = s3('path/file_name.csv', 'bucket_name')  # Default: sep = ','
df = s3('path/file_name.json', 'bucket_name')
df = s3('path/file_name.parquet', 'bucket_name')
df = s3('path/file_name.xlsx', 'bucket_name')  # Default: sheet_name = 'Sheet1'

SQL

pip install 'DataSae[converter,sql]'

[!IMPORTANT] For MacOS users, if pip install failed at mysqlclient, please run this and retry to install again after that.

brew install mysql

MariaDB or MySQL

from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# MariaDB or MySQL to DataFrame
mariadb_or_mysql = config('test_mariadb_or_mysql')
df = mariadb_or_mysql('select 1 column_name from schema_name.table_name;')
df = mariadb_or_mysql('path/file_name.sql')

PostgreSQL

from datasae.converter import Config

# From JSON
config = Config('DataSae/tests/data/config.json')

# From YAML
config = Config('DataSae/tests/data/config.yaml')

# PostgreSQL to DataFrame
postgresql = config('test_postgresql')
df = postgresql('select 1 column_name from schema_name.table_name;')
df = postgresql('path/file_name.sql')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasae-0.5.2.tar.gz (36.6 kB view details)

Uploaded Source

Built Distribution

DataSae-0.5.2-py3-none-any.whl (37.0 kB view details)

Uploaded Python 3

File details

Details for the file datasae-0.5.2.tar.gz.

File metadata

  • Download URL: datasae-0.5.2.tar.gz
  • Upload date:
  • Size: 36.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for datasae-0.5.2.tar.gz
Algorithm Hash digest
SHA256 e9b3f5faed23e1cf3663f9467d58b6622e718f5bfaaaede69d258e1ede88cd56
MD5 22ea5ae50379cb812fcf2e6df5f5318c
BLAKE2b-256 8059535312823de020a760cbc9d5eab7dcc1fc8e21a2a2b907b9ea987681bddf

See more details on using hashes here.

File details

Details for the file DataSae-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: DataSae-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 37.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for DataSae-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ec647de81588cc40c5bdb9aa6375ee880211add3739f7a153cb12befb94c50c4
MD5 fe7dc71bfd06f3a1045c4f2e2fc2faba
BLAKE2b-256 7a22252fb4be144d3be03900affcc2ca9a9f8d456d47dc5e499f19307552c593

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page