Data Profiler Library

Project description

DataProfileViewerAKP

This is a spark compatible library. This will help in profiling data.

Current version has following attributes which are returned as result set:

Result Set

Column:

Column Name from supplied dataframe

DataType:

Datatype of column fetched with "inferSchema"

Count:

Total number of rows

NullCount:

Total number of Null rows

NullPercentage:

Percentage of Null Values

EmptyCount:

Total Number of Empty Rows-->('')

BlankCount:

Total Number of Blank Rows-->(' ')

MaxLength:

Maximum length of data in Column

MinLength:

Minimum length of data in column

AvgLength:

Average length of data in column

DistinctCount:

Distinct Count in Column(appear once or more than once in Column)

UniqueCount:

Unique Count in Column(appear only once in column)

Installation and Driver Code

To install run:

pip install DataProfileViewerAKP

Driver Code:

from DataProfileViewerAKP import DataProfileViewerAKP
df =  spark.read.format('csv').\
      option("header", True).option("inferSchema", True).\
      load(Path)
re=DataProfileViewerAKP.get_data_profile(spark,df)
re.display()

Required Libraries:

import datetime, time
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,FloatType
from pyspark import StorageLevel

You can also define your own spark session:

conf = SparkConf()
conf.set('set hive.vectorized.execution', 'true')
conf.set('set hive.vectorized.execution.enabled', 'true')
conf.set('set hive.cbo.enable', 'true')
conf.set('set hive.compute.query.using.stats', 'true')
conf.set('set hive.stats.fetch.column.stats','true')
conf.set('set hive.stats.fetch.partition.stats', 'true')
conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
spark = SparkSession.builder.appName("profile_driver_program").config(conf=conf).enableHiveSupport().getOrCreate()
spark.sql('set hive.exec.dynamic.partition=True')
spark.sql('set hive.exec.dynamic.partition.mode=nonstrict')

Authors

Abhijeet Kasab (Azure Data Engineer)

Optimizations

Used cache so as to bring dataframe onto memory as the code has multiple operations running on the same dataframe

Project details

Release history Release notifications | RSS feed

This version

0.1.9

Nov 14, 2023

0.1.8

Nov 14, 2023

0.1.7

Nov 14, 2023

0.1.6

Nov 14, 2023

0.1.5

Nov 14, 2023

0.1.4

Nov 14, 2023

0.1.3

Nov 14, 2023

0.1.2

Nov 14, 2023

0.1.1

Nov 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataProfileViewerAKP-0.1.9.tar.gz (3.1 kB view hashes)

Uploaded Nov 14, 2023 Source

Built Distribution

DataProfileViewerAKP-0.1.9-py3-none-any.whl (4.3 kB view hashes)

Uploaded Nov 14, 2023 Python 3

Hashes for DataProfileViewerAKP-0.1.9.tar.gz

Hashes for DataProfileViewerAKP-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`9de03c390de7fe5f0dfc3fe06a328180fe0f2a4e8ef0de11d67a7671f9465491`
MD5	`72a78f81d76972c69884699f8fc721cd`
BLAKE2b-256	`b051422b24f3b67913bfc5ce11e0868cbed8ec6a234f28c1526efdc0639af68e`

Hashes for DataProfileViewerAKP-0.1.9-py3-none-any.whl

Hashes for DataProfileViewerAKP-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`416c017029c81c5e5936161441e3a6bdc5627fe329b8375d6fe64939e1fb0fac`
MD5	`74a112ae97958eb9ca0bf2686b22d7bd`
BLAKE2b-256	`8b09cef3bb89b22322416cd87a463f96237ac9ef65c75049c4d344b7b1bbb43a`