Skip to main content

A package for profiling data and creating data dictionaries.

Project description

DataDictionary Overview

In any data environment, the introduction of new data brings questions about the contents. Discovery and documentation is critical and a lot can be learned of a new data set through basic data profiling. This Python package reads data from a file, directory of files, or a Pandas dataframe and creates a standardized output (Excel workbook) that provides insights on each file's contents. The output can provide any data architect, data engineer, business analyst or data analyst with the information they need to make effective and efficient early decisions about the value of and potential issues in a new data set.

The file processor is able to read any text or Excel file that can be opened with Pandas. Any argument that can be passed to pandas.read_csv() or pandas.read_excel() is valid and used to direct the file processor.

DataDictionary Class and Methods

Get Started

Install DataDictionary

Sample Code

Components of the Output File

The output is an XLSX Microsoft Excel workbook with information spread across mutliple worksheets. Each bullet corresponds to a worksheet in the output file.

Data Types

This worksheet is a great starting point for a data catalogue or data dictionary. At a glance, this report provides:

  • The original field name
    • Presented as the name appears in the sample file
  • A 'clean' version of the field name
    • These values are optimized for legibility and compatibility with RDBMS standards (See Column Name Cleanup)
  • The presumed data type
  • The minimum and maximum lengths for text fields
  • The minimum and maximum values for integer fields
  • The precision and scale for decimal fields
  • Potential ID flag
    • Based on the contents of the field name, make a guess to whether or not it's an ID field
  • Potential PII flag
    • First, look at the contents of the field name, make a guess to whether or not it contains PII
    • Second, look at the values in the field and make a guess to whether or not the contents contain PII
      • If even a single value in a field is presumed to be PII, the entire field is flagged
  • Nullable flag
    • If one or more values are found to be NULL, set to 1 or True

Column Name Cleanup

The cleansing process uses two steps:

  1. Remove unwanted characters
  2. Replace unwanted characters

When processing a dataframe, file, or directory of files, the output of both steps can be controlled through parameters.

  • colname_chars_remove
    • Full customization
  • colname_chars_replace_custom
    • Partial customization (see Table 1)
Target Character(s) Replacement Character(s) Affected by Custom Parameter
\/()[]{},.!?:;-^~`\s+ _ Yes, characters can be added
# num Yes, replacement can be changed
$ usd Yes, replacement can be changed
% pct Yes, replacement can be changed
& and Yes, replacement can be changed
| or Yes, replacement can be changed
@ at Yes, replacement can be changed
+ plus Yes, replacement can be changed
* times Yes, replacement can be changed
= equals Yes, replacement can be changed
< lt Yes, replacement can be changed
> gt Yes, replacement can be changed
Custom Custom Yes, target and replacement characters can be added

Table 1

Data Type Detection

The best source of information for data types from a data source is a physical data model or the DDL used to create the source table. More often than not, that information is either not available or takes a prohibitively long amount of time to obtain from the data owner. This feature is meant to be a helpful suggestion based on the observed records in the sample data. Data types are simplified to avoid prescriptive output (see Table 2).

Data Types in Output
date/datetime
decimal
integer
decimal or integer**
text
N/A*

Table 2

* N/A is assigned to fields that contain only NULL values, no data type can be suggested
** 'decimal or integer' is assigned to fields that may contain integer values in the source file but while processing that file NULL values were detected which Pandas converts to the float data type. Therefore with ambiguous data a loose suggestion is made.

Text Value Distribution

For each text field, a distribution is appended to this worksheet with the count of NULL values appearing at the top.

Numeric Value Distribution

The output on this worksheet is from the Pandas DataFrame.describe() method. It shows the distribution of numeric fields and excludes potential ID fields.

Potential Primary Keys

This is one of the less developed features, however can be useful to highlight fields with heterogenous data that may indicate they may be the natural key or part of the natural key for the given sample data.

Sample Data

This optional sheet takes a number of records from a file and writes them to Sample_Data.

Get Started

Installation

pip install datadictionary

Using DataDictionary

import datadictionary
profiler = datadictionary.ProfileData()

# profile a single file
profiler.process_file('./tests/test1.csv', dest_dir='./tests/', colname_chars_remove=r'aeiou')

# profile a directory of files
profiler.process_directory('./tests/', dest_dir='./tests/', contain='test1', not_contain='test2')

# profile a Pandas DataFrame
profiler.process_dataframe(dest_dir='./tests/', dataframe=sample_df, dataframe_name='sample_df')

DataDictionary

class datadictionary.ProfileData()
process_file(file_path=filepath, dest_dir=filepath, **kwargs)
file_path: path to the file to be profiled
dest_dir: directory for profile to be written
**kwargs includes:

  • colname_chars_replace_underscore: string of invalid characters to be replaced with an underscore
  • colname_chars_replace_custom: dict of characters and their replacement value
  • colname_chars_remove: string of characters to be removed
  • sample_data: None or integer > 0, default 500; number of records to include in a sample_data sheet in output file. If None is passed, the sheet is omitted from the output file.
  • parameter: interpret_date_timestamp - boolean default False, attempt to convert string fields to date or timestamp
  • parameter: interpret_date_timestamp_errors - text default "raise", options are "raise", "ignore", "coerce". "raise" will raise errors on values that cannot be converted, "ignore" will not raise errors and returns the input data, "coerce" will return NaT values when they cannot be converted.
  • pandas.read_csv() or pandas.read_excel() arguments

process_directory(source_dir=filepath, dest_dir=filepath, **kwargs)
source_dir: path to the file to be profiled
dest_dir: directory for profile to be written
**kwargs includes:

  • colname_chars_replace_underscore: string of invalid characters to be replaced with an underscore
  • colname_chars_replace_custom: dict of characters and their replacement value
  • colname_chars_remove: string of characters to be removed
  • sample_data: None or integer > 0, default 500; number of records to include in a sample_data sheet in output file. If None is passed, the sheet is omitted from the output file.
  • parameter: interpret_date_timestamp - boolean default False, attempt to convert string fields to date or timestamp
  • parameter: interpret_date_timestamp_errors - text default "raise", options are "raise", "ignore", "coerce". "raise" will raise errors on values that cannot be converted, "ignore" will not raise errors and returns the input data, "coerce" will return NaT values when they cannot be converted.
  • pandas.read_csv() or pandas.read_excel() arguments

process_dataframe(dest_dir=filepath, dataframe=pandas DataFrame, dataframe_name=string, **kwargs)
dest_dir: directory for profile to be written
**kwargs includes:

  • colname_chars_replace_underscore: string of invalid characters to be replaced with an underscore
  • colname_chars_replace_custom: dict of characters and their replacement value
  • colname_chars_remove: string of characters to be removed
  • sample_data: None or integer > 0, default 500; number of records to include in a sample_data sheet in output file. If None is passed, the sheet is omitted from the output file.
  • parameter: interpret_date_timestamp - boolean default False, attempt to convert string fields to date or timestamp
  • parameter: interpret_date_timestamp_errors - text default "raise", options are "raise", "ignore", "coerce". "raise" will raise errors on values that cannot be converted, "ignore" will not raise errors and returns the input data, "coerce" will return NaT values when they cannot be converted.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadictionary-0.0.6.tar.gz (25.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datadictionary-0.0.6-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file datadictionary-0.0.6.tar.gz.

File metadata

  • Download URL: datadictionary-0.0.6.tar.gz
  • Upload date:
  • Size: 25.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for datadictionary-0.0.6.tar.gz
Algorithm Hash digest
SHA256 1339e27e672ff175b06e5d5c93ace3b71e9c473572c2027cd9718dadc7d3ddec
MD5 78c9dbde847a2eb4103c8d5217e07a5f
BLAKE2b-256 acc6891df0e5080040d2a9642a4370dfd9cd5527d08b1408e1ce0f99cbb38694

See more details on using hashes here.

File details

Details for the file datadictionary-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: datadictionary-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for datadictionary-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 7f52fc4eb2d2fdef2d662eabc8a15969a02a6393087eaa60a679e6d49b5e76b6
MD5 1ffd94a04aa37d0a1953f660c904f0ba
BLAKE2b-256 df1d727099e9c89bd73d78ecc044adcec49e259b676c9b9da7f3eed5d02a20a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page