This library takes one data frame and returns another with a detailed profile of each column
Project description
DataProfile Library
Description
The main function of this library is to generate a Data Profile in a database. For this, it takes a DataFrame as input and analyzes basic aspects such as the number of unique records, duplicates, null and empty data, among others. Each of these points is analyzed for each of the columns of the DataFrame.
Principal Features
-
Count: Count the number of records. Returns a numeric value.
-
Count Distinct: Count the number of distinct records. Returns a numeric value.
-
Unique: Count the unique records. Returns a numeric value.
-
ID Probability: Calculate the probability that the column is an ID. Evaluates the data type, the name of the column, the number of unique IDs, the amount of empty and null records, and estimates a probability. Returns a percent.
-
Email Probability: Find the probability that the column contains emails. Counts the number of "@" symbols and valid domains, then estimates a probability. Returns a percent.
-
Duplicate: Count the duplicate records per column. Returns a numeric value.
-
Numeric: Determine whether the data type is numeric. Returns
True
only if all records in the column are numeric. -
Letter: Determine whether the data type is a string. Returns
True
only if all records in the column are strings. -
Bool: Determine whether the data type is boolean. Returns
True
only if all records in the column are booleans. -
Empty: Count the number of empty records per column. Returns a numeric value.
-
Zero: Count the number of zeros per column. Returns a numeric value.
-
Null: Count the number of null records per column. Returns a numeric value.
Install Requires
- Pandas
- Numpy
- Prettytable
Functions
- dataprofile(DF): This is the main function. It takes a DataFrame as input and returns another one with all the features described above.
How to Start
-
Install the library using pip:
pip install dataprofile
-
Import the dataprofile library:
import dataprofile as dp
-
Create or import a DataFrame: In this case, use
read_csv
from Pandas to import a CSV and create a DataFrame.import pandas as pd def READ_CSV(file_path): return pd.read_csv(file_path, sep=",", encoding='latin-1') FILE = READ_CSV('base-primer-relev-dispositivos.csv')
-
Use the
dataprofile
function on a DataFrame:print(dp.dataprofile(FILE))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dataprofile-1.0.2.tar.gz
.
File metadata
- Download URL: dataprofile-1.0.2.tar.gz
- Upload date:
- Size: 6.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a431624b963b82d1e213732eaf262d20d7f0bace9aafe2f00ee3c984e104d7b |
|
MD5 | 5fac6cb47954fe80907767c08daed74e |
|
BLAKE2b-256 | 586c8f25a0e6c7cf4f1bb9286a362bb8e8ea9ba24ea2793fc949da56590539dd |
File details
Details for the file dataprofile-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: dataprofile-1.0.2-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9a388be0a747ab35a448ebf7f72909a07f127d05f88b5247776e382586488c2 |
|
MD5 | 38cda5f7b9303c7989b94d891f523d54 |
|
BLAKE2b-256 | 054fcdd235f8a239ca8d7d93dc341bbb178d743df6bf12fdc7ec6cf3bf3d16a3 |