What is in your data? Detect schema, statistics and entities in almost any file.

These details have not been verified by PyPI

Project links

Homepage

Project description

PyPI - Python Version GitHub GitHub last commit

Data Profiler | What's in your data?

The DataProfiler is a Python library designed to make data analysis, monitoring and sensitive data detection easy.

Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI) and more. Data Profiles can then be used in downstream applications or reports.

Getting started only takes a few lines of code (example csv):

import json
from dataprofiler import Data, Profiler

data = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text

print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame

profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc

readable_report = profile.report(report_options={"output_format":"compact"})

print(json.dumps(readable_report, indent=4))

Note: The Data Profiler comes with a pre-trained deep learning model, used to efficiently identify sensitive data (PII / NPI). If desired, it's easy to add new entities to the existing pre-trained model or insert an entire new pipeline for entity recognition.

For API documentation, visit the documentation page.

If you have suggestions or find a bug, please open an issue.

Install

To install the full package from pypi: pip install DataProfiler[ml]

If the ML requirements are too strict (say, you don't want to install tensorflow), you can install a slimmer package. The slimmer package disables the default sensitive data detection / entity recognition (labler)

Install from pypi: pip install DataProfiler

What is a Data Profile?

In the case of this library, a data profile is a dictionary containing statistics and predictions about the underlying dataset. There are "global statistics" or global_stats, which contain dataset level data and there are "column/row level statistics" or data_stats (each column is a new key-value entry).

The format for a profile is below:

"global_stats": {
    "samples_used": int,
    "column_count": int,
    "row_count": int,
    "row_has_null_ratio": float,
    "row_is_null_ratio": float,    
    "unique_row_ratio": float,
    "duplicate_row_count": int,
    "file_type": string,
    "encoding": string,
},
"data_stats": {
    <column name>: {
        "column_name": string,
        "data_type": string,
        "data_label": string,
        "categorical": bool,
        "order": string,
	"samples": list(str),
        "statistics": {
            "sample_size": int,
            "null_count": int,
            "null_types": list(string),
            "null_types_index": {
                string: list(int)
            },
            "data_type_representation": [string, list(string)],
            "min": [null, float],
            "max": [null, float],
            "mean": float,
            "variance": float,
            "stddev": float,
            "histogram": { 
                "bin_counts": list(int),
		"bin_edges": list(float),
            },
            "quantiles": {
                int: float
            }
            "vocab": list(char),
            "avg_predictions": dict(float), 
            "data_label_representation": dict(float),
            "categories": list(str),
            "unique_count": int,
            "unique_ratio": float,
            "precision": {
	        'min': int,
		'max': int,
		'mean': float,
		'var': float,
		'std': float,
		'sample_size': int,
		'margin_of_error': float,
		'confidence_level': float		
	    },
            "times": dict(float),
            "format": string
        }
    }
}

Support

Supported Data Formats

Any delimited file (CSV, TSV, etc.)
JSON object
Avro file
Parquet file
Pandas DataFrame

Data Types

Data Types are determined at the column level for structured data

Int
Float
String
DateTime

Data Labels

Data Labels are determined per cell for structured data (column/row when the profiler is used) or at the character level for unstructured data.

UNKNOWN
ADDRESS
BAN (bank account number, 10-18 digits)
CREDIT_CARD
EMAIL_ADDRESS
UUID
HASH_OR_KEY (md5, sha1, sha256, random hash, etc.)
IPV4
IPV6
MAC_ADDRESS
PERSON
PHONE_NUMBER
SSN
URL
US_STATE
DRIVERS_LICENSE
DATE
TIME
DATETIME
INTEGER
FLOAT
QUANTITY
ORDINAL

Get Started

Load a File

The Data Profiler can profile the following data/file types:

CSV file (or any delimited file)
JSON object
Avro file
Parquet file
Pandas DataFrame

The profiler should automatically identify the file type and load the data into a Data Class.

Along with other attributtes the Data class enables data to be accessed via a valid Pandas DataFrame.

# Load a csv file, return a CSVData object
csv_data = Data('your_file.csv') 

# Print the first 10 rows of the csv file
print(csv_data.data.head(10))

# Load a parquet file, return a ParquetData object
parquet_data = Data('your_file.parquet')

# Sort the data by the name column
parquet_data.data.sort_values(by='name', inplace=True)

# Print the sorted first 10 rows of the parquet data
print(parquet_data.data.head(10))

If the file type is not automatically identified (rare), you can specify them specifically, see section Specifying a Filetype or Delimiter.

Profile a File

Example uses a CSV file for example, but CSV, JSON, Avro or Parquet should also work.

import json
from dataprofiler import Data, Profiler

# Load file (CSV should be automatically identified)
data = Data("your_file.csv") 

# Profile the dataset
profile = Profiler(data)

# Generate a report and use json to prettify.
report  = profile.report(report_options={"output_format":"pretty"})

# Print the report
print(json.dumps(report, indent=4))

Updating Profiles

Currently, the data profiler is equipped to update its profile in batches.

import json
from dataprofiler import Data, Profiler

# Load and profile a CSV file
data = Data("your_file.csv")
profile = Profiler(data)

# Update the profile with new data:
new_data = Data("new_data.csv")
profile.update_profile(new_data)

# Print the report using json to prettify.
report  = profile.report(report_options={"output_format":"pretty"})
print(json.dumps(report, indent=4))

Note that if the data you update the profile with contains integer indices that overlap with the indices on data originally profiled, when null rows are calculated the indices will be "shifted" to uninhabited values so that null counts and ratios are still accurate.

Merging Profiles

If you have two files with the same schema (but different data), it is possible to merge the two profiles together via an addition operator.

This also enables profiles to be determined in a distributed manner.

import json
from dataprofiler import Data, Profiler

# Load a CSV file with a schema
data1 = Data("file_a.csv")
profile1 = Profiler(data)

# Load another CSV file with the same schema
data2 = Data("file_b.csv")
profile2 = Profiler(data)

profile3 = profile1 + profile2

# Print the report using json to prettify.
report  = profile3.report(report_options={"output_format":"pretty"})
print(json.dumps(report, indent=4))

Note that if merged profiles had overlapping integer indices, when null rows are calculated the indices will be "shifted" to uninhabited values so that null counts and ratios are still accurate.

Profile a Pandas DataFrame

import pandas as pd
import dataprofiler as dp
import json

my_dataframe = pd.DataFrame([[1, 2.0],[1, 2.2],[-1, 3]])
profile = dp.Profiler(my_dataframe)

# print the report using json to prettify.
report = profile.report(report_options={"output_format":"pretty"})
print(json.dumps(report, indent=4))

# read a specified column, in this case it is labeled 0:
print(json.dumps(report["data stats"][0], indent=4))

Visit the documentation page for additional Examples and API details

References

Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions
Authors: Anh Truong, Austin Walters, Jeremy Goodsitt
2020 https://arxiv.org/abs/2012.09597
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.12.0

Jun 14, 2024

0.11.0

May 21, 2024

0.10.9

Mar 6, 2024

0.10.8

Jan 11, 2024

0.10.7

Nov 14, 2023

0.10.6

Nov 13, 2023

0.10.5

Sep 25, 2023

0.10.4

Sep 22, 2023

0.10.3

Aug 7, 2023

0.10.2

Jul 28, 2023

0.10.1

Jul 17, 2023

0.10.0

Jun 30, 2023

0.9.0

Jun 1, 2023

0.8.9

Apr 12, 2023

0.8.8

Feb 21, 2023

0.8.7.post1

Jan 27, 2023

0.8.7

Jan 23, 2023

0.8.6

Jan 6, 2023

0.8.5

Dec 20, 2022

0.8.4

Dec 2, 2022

0.8.3

Nov 10, 2022

0.8.2.post1

Oct 21, 2022

0.8.2

Oct 19, 2022

0.8.1

Oct 5, 2022

0.8.0

Sep 20, 2022

0.7.11

Aug 22, 2022

0.7.10

Aug 9, 2022

0.7.9

Jun 28, 2022

0.7.8

Jun 7, 2022

0.7.7

Apr 5, 2022

0.7.6

Feb 4, 2022

0.7.5

Jan 28, 2022

0.7.4

Nov 19, 2021

0.7.3

Oct 28, 2021

0.7.2

Oct 18, 2021

0.7.1

Aug 9, 2021

0.7.0

Jul 30, 2021

0.6.1

Jul 16, 2021

0.6.0

Jul 14, 2021

0.5.3

Jun 28, 2021

0.5.2

Jun 25, 2021

0.5.1

Jun 8, 2021

0.5.0

Jun 2, 2021

This version

0.4.6

May 24, 2021

0.4.5

Apr 30, 2021

0.4.4

Apr 26, 2021

0.4.3

Apr 22, 2021

0.4.2

Apr 6, 2021

0.4.1

Mar 25, 2021

0.4.0

Mar 25, 2021

0.3.5

Mar 16, 2021

0.3.4

Mar 12, 2021

0.3.2

Feb 23, 2021

0.3.0

Feb 11, 2021

0.2.3

Feb 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataProfiler-0.4.6.tar.gz (3.8 MB view hashes)

Uploaded May 24, 2021 Source

Built Distribution

DataProfiler-0.4.6-py3-none-any.whl (7.4 MB view hashes)

Uploaded May 24, 2021 Python 3

Hashes for DataProfiler-0.4.6.tar.gz

Hashes for DataProfiler-0.4.6.tar.gz
Algorithm	Hash digest
SHA256	`21cfda06ecf3bd040601288e49a952682ac3360dca5b4c3e774738cea45d7284`
MD5	`d8b0d859d232701fbdb0ced199e22066`
BLAKE2b-256	`4367f22b41fe63a04f814483130847c9730b5e947c695fa6c3dfe5933dbca97b`

Hashes for DataProfiler-0.4.6-py3-none-any.whl

Hashes for DataProfiler-0.4.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d7f56c62f85c35115438761647b2b891979d6a5880996e11445c8ede06653dba`
MD5	`933ed732652862574bad46fdee74f843`
BLAKE2b-256	`e86c507e67bd5fffe8595e9ef29f2aa0a459c25d9d831942e517d4241c0e61ce`