Skip to main content

Data Profiler Library

Project description

DataProfileViewerAKP

This is a spark compatible library. This will help in profiling data.

Current version has following attributes which are returned as result set:

Result Set

Column:

Column Name from supplied dataframe

DataType:

Datatype of column fetched with "inferSchema"

Count:

Total number of rows

NullCount:

Total number of Null rows

NullPercentage:

Percentage of Null Values

EmptyCount:

Total Number of Empty Rows-->('')

BlankCount:

Total Number of Blank Rows-->(' ')

MaxLength:

Maximum length of data in Column

MinLength:

Minimum length of data in column

AvgLength:

Average length of data in column

DistinctCount:

Distinct Count in Column(appear once or more than once in Column)

UniqueCount:

Unique Count in Column(appear only once in column)

Installation and Driver Code

To install run:

pip install DataProfileViewerAKP

Driver Code:

from DataProfileViewerAKP import DataProfileViewerAKP
df =  spark.read.format('csv').\
      option("header", True).option("inferSchema", True).\
      load(Path)
re=DataProfileViewerAKP.get_data_profile(spark,df)
re.display()

Required Libraries:

import datetime, time
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,FloatType
from pyspark import StorageLevel

You can also define your own spark session:

conf = SparkConf()
conf.set('set hive.vectorized.execution', 'true')
conf.set('set hive.vectorized.execution.enabled', 'true')
conf.set('set hive.cbo.enable', 'true')
conf.set('set hive.compute.query.using.stats', 'true')
conf.set('set hive.stats.fetch.column.stats','true')
conf.set('set hive.stats.fetch.partition.stats', 'true')
conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
spark = SparkSession.builder.appName("profile_driver_program").config(conf=conf).enableHiveSupport().getOrCreate()
spark.sql('set hive.exec.dynamic.partition=True')
spark.sql('set hive.exec.dynamic.partition.mode=nonstrict')

Authors

Abhijeet Kasab (Azure Data Engineer)

Optimizations

Used cache so as to bring dataframe onto memory as the code has multiple operations running on the same dataframe

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataProfileViewerAKP-0.1.9.tar.gz (3.1 kB view hashes)

Uploaded Source

Built Distribution

DataProfileViewerAKP-0.1.9-py3-none-any.whl (4.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page