Skip to main content

This library takes one data frame and returns another with a detailed profile of each column

Project description

DataProfile Library

Description

The main function of this library is to generate a Data Profile in a database. For this, it takes a DataFrame as input and analyzes basic aspects such as the number of unique records, duplicates, null and empty data, among others. Each of these points is analyzed for each of the columns of the DataFrame.

Principal Features

  • Count: Count the number of records. Returns a numeric value.

  • Count Distinct: Count the number of distinct records. Returns a numeric value.

  • Unique: Count the unique records. Returns a numeric value.

  • ID Probability: Calculate the probability that the column is an ID. Evaluates the data type, the name of the column, the number of unique IDs, the amount of empty and null records, and estimates a probability. Returns a percent.

  • Email Probability: Find the probability that the column contains emails. Counts the number of "@" symbols and valid domains, then estimates a probability. Returns a percent.

  • Duplicate: Count the duplicate records per column. Returns a numeric value.

  • Numeric: Determine whether the data type is numeric. Returns True only if all records in the column are numeric.

  • Letter: Determine whether the data type is a string. Returns True only if all records in the column are strings.

  • Bool: Determine whether the data type is boolean. Returns True only if all records in the column are booleans.

  • Empty: Count the number of empty records per column. Returns a numeric value.

  • Zero: Count the number of zeros per column. Returns a numeric value.

  • Null: Count the number of null records per column. Returns a numeric value.

Install Requires

  • Pandas
  • Numpy
  • Prettytable

Functions

  • dataprofile(DF): This is the main function. It takes a DataFrame as input and returns another one with all the features described above.

How to Start

  1. Install the library using pip:

    pip install dataprofile
    
  2. Import the dataprofile library:

    import dataprofile as dp
    
  3. Create or import a DataFrame: In this case, use read_csv from Pandas to import a CSV and create a DataFrame.

    import pandas as pd
    
    def READ_CSV(file_path):
        return pd.read_csv(file_path, sep=",", encoding='latin-1')
    
    FILE = READ_CSV('base-primer-relev-dispositivos.csv')
    
  4. Use the dataprofile function on a DataFrame:

    print(dp.dataprofile(FILE))
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataprofile-1.0.2.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

dataprofile-1.0.2-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file dataprofile-1.0.2.tar.gz.

File metadata

  • Download URL: dataprofile-1.0.2.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for dataprofile-1.0.2.tar.gz
Algorithm Hash digest
SHA256 7a431624b963b82d1e213732eaf262d20d7f0bace9aafe2f00ee3c984e104d7b
MD5 5fac6cb47954fe80907767c08daed74e
BLAKE2b-256 586c8f25a0e6c7cf4f1bb9286a362bb8e8ea9ba24ea2793fc949da56590539dd

See more details on using hashes here.

File details

Details for the file dataprofile-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: dataprofile-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for dataprofile-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d9a388be0a747ab35a448ebf7f72909a07f127d05f88b5247776e382586488c2
MD5 38cda5f7b9303c7989b94d891f523d54
BLAKE2b-256 054fcdd235f8a239ca8d7d93dc341bbb178d743df6bf12fdc7ec6cf3bf3d16a3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page