Data Profiler Library
Project description
DataProfileViewerAKP
This is a spark compatible library. This will help in profiling data.
Current version has following attributes which are returned as result set:
Result Set
Column:
Column Name from supplied dataframe
DataType:
Datatype of column fetched with "inferSchema"
Count:
Total number of rows
NullCount:
Total number of Null rows
NullPercentage:
Percentage of Null Values
EmptyCount:
Total Number of Empty Rows-->('')
BlankCount:
Total Number of Blank Rows-->(' ')
MaxLength:
Maximum length of data in Column
MinLength:
Minimum length of data in column
AvgLength:
Average length of data in column
DistinctCount:
Distinct Count in Column(appear once or more than once in Column)
UniqueCount:
Unique Count in Column(appear only once in column)
Installation and Driver Code
To install run:
pip install DataProfileViewerAKP
Driver Code:
from DataProfileViewerAKP import DataProfileViewerAKP
df = spark.read.format('csv').\
option("header", True).option("inferSchema", True).\
load(Path)
re=DataProfileViewerAKP.get_data_profile(spark,df)
re.display()
Required Libraries:
import datetime, time
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,FloatType
from pyspark import StorageLevel
You can also define your own spark session:
conf = SparkConf()
conf.set('set hive.vectorized.execution', 'true')
conf.set('set hive.vectorized.execution.enabled', 'true')
conf.set('set hive.cbo.enable', 'true')
conf.set('set hive.compute.query.using.stats', 'true')
conf.set('set hive.stats.fetch.column.stats','true')
conf.set('set hive.stats.fetch.partition.stats', 'true')
conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
spark = SparkSession.builder.appName("profile_driver_program").config(conf=conf).enableHiveSupport().getOrCreate()
spark.sql('set hive.exec.dynamic.partition=True')
spark.sql('set hive.exec.dynamic.partition.mode=nonstrict')
Authors
Abhijeet Kasab (Azure Data Engineer)
Optimizations
Used cache so as to bring dataframe onto memory as the code has multiple operations running on the same dataframe
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for DataProfileViewerAKP-0.1.9.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9de03c390de7fe5f0dfc3fe06a328180fe0f2a4e8ef0de11d67a7671f9465491 |
|
MD5 | 72a78f81d76972c69884699f8fc721cd |
|
BLAKE2b-256 | b051422b24f3b67913bfc5ce11e0868cbed8ec6a234f28c1526efdc0639af68e |
Hashes for DataProfileViewerAKP-0.1.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 416c017029c81c5e5936161441e3a6bdc5627fe329b8375d6fe64939e1fb0fac |
|
MD5 | 74a112ae97958eb9ca0bf2686b22d7bd |
|
BLAKE2b-256 | 8b09cef3bb89b22322416cd87a463f96237ac9ef65c75049c4d344b7b1bbb43a |