What is in your data? Detect schema, statistics and entities in almost any file.

These details have not been verified by PyPI

Project links

Homepage

Project description

PyPI - Python Version GitHub GitHub last commit

Data Profiler | What's in your data?

The DataProfiler is a Python library designed to make data analysis, monitoring and sensitive data detection easy.

Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI) and more. Data Profiles can then be used in downstream applications or reports.

Getting started only takes a few lines of code (example csv):

import json
from dataprofiler import Data, Profiler

data = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text

print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame

profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc

readable_report = profile.report(report_options={"output_format": "compact"})

print(json.dumps(readable_report, indent=4))

Note: The Data Profiler comes with a pre-trained deep learning model, used to efficiently identify sensitive data (PII / NPI). If desired, it's easy to add new entities to the existing pre-trained model or insert an entire new pipeline for entity recognition.

For API documentation, visit the documentation page.

If you have suggestions or find a bug, please open an issue.

Install

To install the full package from pypi: pip install DataProfiler[ml]

If the ML requirements are too strict (say, you don't want to install tensorflow), you can install a slimmer package. The slimmer package disables the default sensitive data detection / entity recognition (labler)

Install from pypi: pip install DataProfiler

What is a Data Profile?

In the case of this library, a data profile is a dictionary containing statistics and predictions about the underlying dataset. There are "global statistics" or global_stats, which contain dataset level data and there are "column/row level statistics" or data_stats (each column is a new key-value entry).

The format for a structured profile is below:

"global_stats": {
    "samples_used": int,
    "column_count": int,
    "row_count": int,
    "row_has_null_ratio": float,
    "row_is_null_ratio": float,    
    "unique_row_ratio": float,
    "duplicate_row_count": int,
    "file_type": string,
    "encoding": string,
},
"data_stats": {
    <column name>: {
        "column_name": string,
        "data_type": string,
        "data_label": string,
        "categorical": bool,
        "order": string,
	"samples": list(str),
        "statistics": {
            "sample_size": int,
            "null_count": int,
            "null_types": list(string),
            "null_types_index": {
                string: list(int)
            },
            "data_type_representation": [string, list(string)],
            "min": [null, float],
            "max": [null, float],
            "mean": float,
            "variance": float,
            "stddev": float,
            "histogram": { 
                "bin_counts": list(int),
		"bin_edges": list(float),
            },
            "quantiles": {
                int: float
            }
            "vocab": list(char),
            "avg_predictions": dict(float), 
            "data_label_representation": dict(float),
            "categories": list(str),
            "unique_count": int,
            "unique_ratio": float,
            "precision": {
	        'min': int,
		'max': int,
		'mean': float,
		'var': float,
		'std': float,
		'sample_size': int,
		'margin_of_error': float,
		'confidence_level': float		
	    },
            "times": dict(float),
            "format": string
        }
    }
}

The format for an unstructured profile is below:

{
    "global_stats": {
        "samples_used": int,
        "empty_line_count": int,
        "file_type": string,
        "encoding": string
    },
    "data_stats": {
        "data_label": {
            "entity_counts": {
                "word_level": dict(int),
                "true_char_level": dict(int),
                "postprocess_char_level": dict(int)
            },
            "times": dict(float)
        },
        "statistics": {
            "vocab": list(char),
            "words": list(string),
            "word_count": dict(int),
            "times": dict(float)
        }
    }
}

Support

Supported Data Formats

Any delimited file (CSV, TSV, etc.)
JSON object
Avro file
Parquet file
Text file
Pandas DataFrame

Data Types

Data Types are determined at the column level for structured data

Int
Float
String
DateTime

Data Labels

Data Labels are determined per cell for structured data (column/row when the profiler is used) or at the character level for unstructured data.

UNKNOWN
ADDRESS
BAN (bank account number, 10-18 digits)
CREDIT_CARD
EMAIL_ADDRESS
UUID
HASH_OR_KEY (md5, sha1, sha256, random hash, etc.)
IPV4
IPV6
MAC_ADDRESS
PERSON
PHONE_NUMBER
SSN
URL
US_STATE
DRIVERS_LICENSE
DATE
TIME
DATETIME
INTEGER
FLOAT
QUANTITY
ORDINAL

Get Started

Load a File

The Data Profiler can profile the following data/file types:

CSV file (or any delimited file)
JSON object
Avro file
Parquet file
Text file
Pandas DataFrame

The profiler should automatically identify the file type and load the data into a Data Class.

Along with other attributtes the Data class enables data to be accessed via a valid Pandas DataFrame.

# Load a csv file, return a CSVData object
csv_data = Data('your_file.csv') 

# Print the first 10 rows of the csv file
print(csv_data.data.head(10))

# Load a parquet file, return a ParquetData object
parquet_data = Data('your_file.parquet')

# Sort the data by the name column
parquet_data.data.sort_values(by='name', inplace=True)

# Print the sorted first 10 rows of the parquet data
print(parquet_data.data.head(10))

If the file type is not automatically identified (rare), you can specify them specifically, see section Specifying a Filetype or Delimiter.

Profile a File

Example uses a CSV file for example, but CSV, JSON, Avro, Parquet or Text should also work.

import json
from dataprofiler import Data, Profiler

# Load file (CSV should be automatically identified)
data = Data("your_file.csv") 

# Profile the dataset
profile = Profiler(data)

# Generate a report and use json to prettify.
report  = profile.report(report_options={"output_format": "pretty"})

# Print the report
print(json.dumps(report, indent=4))

Updating Profiles

Currently, the data profiler is equipped to update its profile in batches.

import json
from dataprofiler import Data, Profiler

# Load and profile a CSV file
data = Data("your_file.csv")
profile = Profiler(data)

# Update the profile with new data:
new_data = Data("new_data.csv")
profile.update_profile(new_data)

# Print the report using json to prettify.
report  = profile.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))

Note that if the data you update the profile with contains integer indices that overlap with the indices on data originally profiled, when null rows are calculated the indices will be "shifted" to uninhabited values so that null counts and ratios are still accurate.

Merging Profiles

If you have two files with the same schema (but different data), it is possible to merge the two profiles together via an addition operator.

This also enables profiles to be determined in a distributed manner.

import json
from dataprofiler import Data, Profiler

# Load a CSV file with a schema
data1 = Data("file_a.csv")
profile1 = Profiler(data)

# Load another CSV file with the same schema
data2 = Data("file_b.csv")
profile2 = Profiler(data)

profile3 = profile1 + profile2

# Print the report using json to prettify.
report  = profile3.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))

Note that if merged profiles had overlapping integer indices, when null rows are calculated the indices will be "shifted" to uninhabited values so that null counts and ratios are still accurate.

Profile a Pandas DataFrame

import pandas as pd
import dataprofiler as dp
import json

my_dataframe = pd.DataFrame([[1, 2.0],[1, 2.2],[-1, 3]])
profile = dp.Profiler(my_dataframe)

# print the report using json to prettify.
report = profile.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))

# read a specified column, in this case it is labeled 0:
print(json.dumps(report["data stats"][0], indent=4))

Unstructured profiler

In addition to the structured profiler, DataProfiler provides unstructured profiling for the TextData object or string. The unstructured profiler also works with list(string), pd.Series(string) or pd.DataFrame(string) given profiler_type option specified as unstructured. Below is an example of the unstructured profiler with a text file.

import dataprofiler as dp
import json

my_text = dp.Data('text_file.txt')
profile = dp.Profiler(my_text)

# print the report using json to prettify.
report = profile.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))

Another example of the unstructured profiler with pd.Series of strings is given as below, with the profiler option profiler_type='unstructured'

import dataprofiler as dp
import pandas as pd
import json

text_data = pd.Series(['first string', 'second string'])
profile = dp.Profiler(text_data, profiler_type='unstructured')

# print the report using json to prettify.
report = profile.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))

Visit the documentation page for additional Examples and API details

References

Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions
Authors: Anh Truong, Austin Walters, Jeremy Goodsitt
2020 https://arxiv.org/abs/2012.09597
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.13.4

Jul 30, 2025

0.13.3

Mar 18, 2025

0.13.2

Mar 13, 2025

0.12.0

Jun 14, 2024

0.11.0

May 21, 2024

0.10.9

Mar 6, 2024

0.10.8

Jan 11, 2024

0.10.7

Nov 14, 2023

0.10.6

Nov 13, 2023

0.10.5

Sep 25, 2023

0.10.4

Sep 22, 2023

0.10.3

Aug 7, 2023

0.10.2

Jul 28, 2023

0.10.1

Jul 17, 2023

0.10.0

Jun 30, 2023

0.9.0

Jun 1, 2023

0.8.9

Apr 12, 2023

0.8.8

Feb 21, 2023

0.8.7.post1

Jan 27, 2023

0.8.7

Jan 23, 2023

0.8.6

Jan 6, 2023

0.8.5

Dec 20, 2022

0.8.4

Dec 2, 2022

0.8.3

Nov 10, 2022

0.8.2.post1

Oct 21, 2022

0.8.2

Oct 19, 2022

0.8.1

Oct 5, 2022

0.8.0

Sep 20, 2022

0.7.11

Aug 22, 2022

0.7.10

Aug 9, 2022

0.7.9

Jun 28, 2022

0.7.8

Jun 7, 2022

0.7.7

Apr 5, 2022

0.7.6

Feb 4, 2022

0.7.5

Jan 28, 2022

0.7.4

Nov 19, 2021

0.7.3

Oct 28, 2021

0.7.2

Oct 18, 2021

0.7.1

Aug 9, 2021

0.7.0

Jul 30, 2021

0.6.1

Jul 16, 2021

0.6.0

Jul 14, 2021

0.5.3

Jun 28, 2021

0.5.2

Jun 25, 2021

0.5.1

Jun 8, 2021

This version

0.5.0

Jun 2, 2021

0.4.6

May 24, 2021

0.4.5

Apr 30, 2021

0.4.4

Apr 26, 2021

0.4.3

Apr 22, 2021

0.4.2

Apr 6, 2021

0.4.1

Mar 25, 2021

0.4.0

Mar 25, 2021

0.3.5

Mar 16, 2021

0.3.4

Mar 12, 2021

0.3.2

Feb 23, 2021

0.3.0

Feb 11, 2021

0.2.3

Feb 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataProfiler-0.5.0.tar.gz (3.8 MB view details)

Uploaded Jun 2, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

DataProfiler-0.5.0-py3-none-any.whl (7.4 MB view details)

Uploaded Jun 2, 2021 Python 3

File details

Details for the file DataProfiler-0.5.0.tar.gz.

File metadata

Download URL: DataProfiler-0.5.0.tar.gz
Upload date: Jun 2, 2021
Size: 3.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.10

File hashes

Hashes for DataProfiler-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`fd43ec4db66a6fd518e9f5ea1391b5a44b0512d5ac607a5aacf9258ba9dfaab6`
MD5	`1be52f22e62d6eb44b64b9cf5f4d14c4`
BLAKE2b-256	`2c2309540a21c948d8743f35c19050f99d46715eca6afa14bef7f864f2c69d7e`

See more details on using hashes here.

File details

Details for the file DataProfiler-0.5.0-py3-none-any.whl.

File metadata

Download URL: DataProfiler-0.5.0-py3-none-any.whl
Upload date: Jun 2, 2021
Size: 7.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.10

File hashes

Hashes for DataProfiler-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0a00fd95cd7bf8801852622a4f4bc29629c874c447bb3e48f8850a90f13e0b8`
MD5	`c7739976975ea3d9fa3ef4882ceba8f2`
BLAKE2b-256	`9ddc5f60939d71baa43375668d2ce3ac2438574cf0fbe0cb1f1324a72b5181c3`

See more details on using hashes here.

DataProfiler 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data Profiler | What's in your data?

Install

What is a Data Profile?

Support

Supported Data Formats

Data Types

Data Labels

Get Started

Load a File

Profile a File

Updating Profiles

Merging Profiles

Profile a Pandas DataFrame

Unstructured profiler

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes