Package to help with normalizing data needed for the platform!

Project description

General Information

The DataNormalizer package is made to help data scientists with validating and normalising data that they are going to import. Right now the library is able to validate columns and check datasets based on a set of rules like custom data types.

import DataNormalizer

Setting variables

The DataNormalizer library might need the details of the target app. These setting are set through properties.

Clappform.Auth(baseURL="https://dev.clappform.com/", username="user@email.com", password="password")
Normalise = DataNormalizer.Normalise()
Normalise.app_data = Clappform.App("appname").ReadOne(extended=True)
Normalise.dataframe = pandas.read_excel("../data.xlsx")
Normalise.rules = json.load(open('../rules.json'))

checkRules

Function that will check the custom rules against your dataframe. Requires dataframe and rules. Returns a dataframe

Normalise = DataNormalizer.Normalise()
Normalise.dataframe = pandas.read_excel("../data.xlsx")
Normalise.rules = json.load(open('../rules.json'))
result = Normalise.checkRules()

Rules are added in a JSON file. Every column has its own rule, however rules without a column name are seen as global rules.

[
{
    "reset_coverage":"True",
    "action": "np.nan",
    "verbose": "to_file"
},
{ 
    "column": "city",
    "check_coverage": "10",
    "selection": [ "Aa en Hunze", "Aalsmeer", "Aalten", "Achtkarspelen"]
},
{
    "column": "postalCode",
    "type": "postal_code"
}
]

Supported keys are

keys	value	explanation	Scope
verbose	to_file / silent...	How do you want to be notified of errors?	Column / Global
column	gemeente	On which column does this rule apply	Column
type	postal_code / int / string...	What should the values of this column be	Column
action	np-nan	What to do with the value if incorrect	Column / Global
selection	[ "Aa en Hunze", "Aalsmeer", "Aalten"]	The values must be one of these values	Column
one_hot_encoding	"prefix" or "" with no prefix	Concat pandas dummies on the dataframe	Column
normalise	["remove_special"] or ["remove_special", "capitalize"]	Apply one or more normalisation rules	Column
normalise_columns	["remove_special"] or ["remove_special", "capitalize"]	Apply one or more normalisation rules on all columns (global rule) or on a single column	Column / Global
concat	{"name": "uniqueID", "columns": ["col1", "col2"]}	Concatenate columns together (for unique ID generation)	Global
operator	{"name": "divide", "columns": ["A", "B"], "type": "divide"}	Apply operator on two columns, result to new column	Global
drop_duplicates	["col1", "col2"]	Drop duplicates based on a subset of column values	Global
shared_colname_drop	"anything"	If multiple shared column names, keep one	Global
timestamp	"%Y-%m-%d"	Map values to a datetime object, skip rule immediatly if one value fails	Column
range_time	["2017-12-01 00:00:00", "2012-12-01 00:00:00"]	Converts data to timestamp and checks range, combining with timestamp improves speed	Column
fillna	value	Fill every NaN value to a key	Column / Global
range	["-inf", 2010] / [2010, "inf"]	Number range, left <= value >= right	Column
mapping	{"bad": "0","moderate": "1"}	Map row values to something else	Column / Global
column_mapping	{"postcode": "postal_code","stad": "city"}	Map column values to something else	Column / Global
regex	[1-9][0-9]?$^100$	Column value should look like this regex	Column
check_coverage	50	Take a smaller sample of the column, in percentage	Column / Global
reset_coverage	True / False	If an error is found in the sample, fall back to 100%	Column / Global

Supported values for types

type	explanation
int	accepts ints and floats get decimal removed
positive-int	same as int but only positive and zero
negative-int	same as int but only negative
string	characters accepted
float	decimal numbers accepted
boolean	makes lowercase and accepts true / false
postal_code	Removes special chars, accepts 1111ab and 1111AB
street	Accepts letters, spaces and '. Makes first character and characters after ' uppercase
latitude / longitude	accepts 32.111111 format
letters	only accepts letters

Normalise options for column and rows

value	explanation
capitalize	Make first character uppercase
lowercase	Make whole string lowercase
uppercase	Make whole string uppercase
remove_special	Remove special characters
remove_whitespace	Remove whitespaces
spaces_to_underscore	Make every space an underscore _
spaces_to_hyphen	Make every space a hyphen -

Operator options

operator	explanation
divide	Divide two columns on row level and put result to new column
multiply	Multiply two columns on row level and put result to new column

Fillna options

action	explanation	value
fillna	Fill NaN value with something else	Some thing to fill NaN with
fillna_diffcol	Fill NaN with the value of a different column on that row	Other column name
fillna_mean	Fill NaN with mean of column	Doesn't matter
fillna_median	Fill NaN with median of column	Doesn't matter

Supported values for action

action	explanation
np.nan	Replaces mismatches with np.nan
drop	Drop the row

Supported values for verbose

action	explanation
to_console	(DEFAULT) Print errors to console
to_file	Print errors to a file with a timestamp.txt format
silent	Dont print

obtainKeys

Function that will find keys needed for the app, needs app_data. Returns keys

Validate = DataNormalizer.Validate()
Validate.app_data = Clappform.App("appname").ReadOne(extended=True)
Validate.obtainKeys()

matchKeys

Function that will find missing keys, needs app_data and dataframe. Returns missing and additional keys

Validate = DataNormalizer.Validate()
Validate.app_data = Clappform.App("appname").ReadOne(extended=True)
Validate.dataframe = pandas.read_excel("../data.xlsx")
Validate.matchKeys()

fixMismatch

Function that will suggest changes to your dataset based on missing keys, needs app_data and dataframe. Lowering the strictness will increase the amount of matches with possible keys. Needs app_data and dataframe. Interaction via terminal.

Validate = DataNormalizer.Validate()
Validate.app_data = Clappform.App("appname").ReadOne(extended=True)
Validate.dataframe = pandas.read_excel("../data.xlsx")
Validate.fixMismatch(strictness = 0.8)

Project details

Release history Release notifications | RSS feed

This version

1.3.4

Jan 11, 2023

1.3.3

Jan 6, 2023

1.3.2

Dec 28, 2022

1.3.1

Dec 14, 2022

1.3

Dec 14, 2022

1.2

Dec 13, 2022

1.1.4

Dec 5, 2022

1.1.3

Nov 28, 2022

1.1.2

Nov 22, 2022

1.1.1

Nov 15, 2022

1.1.0

Nov 14, 2022

1.0.3

Nov 10, 2022

1.0.2

Nov 10, 2022

1.0.1

Nov 9, 2022

1.0.0

Nov 2, 2022

0.0.2.dev8 pre-release

Oct 11, 2022

0.0.2.dev7 pre-release

Oct 11, 2022

0.0.2.dev6 pre-release

Oct 11, 2022

0.0.2.dev5 pre-release

Oct 11, 2022

0.0.2.dev4 pre-release

Oct 11, 2022

0.0.2.dev3 pre-release

Oct 11, 2022

0.0.2.dev2 pre-release

Oct 10, 2022

0.0.2.dev1 pre-release

Oct 10, 2022

0.0.2.dev0 pre-release

Oct 10, 2022

0.0.1.dev2 pre-release

Oct 4, 2022

0.0.1.dev1 pre-release

Oct 3, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataNormalizer-1.3.4.tar.gz (9.6 kB view details)

Uploaded Jan 11, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

DataNormalizer-1.3.4-py3-none-any.whl (10.7 kB view details)

Uploaded Jan 11, 2023 Python 3

File details

Details for the file DataNormalizer-1.3.4.tar.gz.

File metadata

Download URL: DataNormalizer-1.3.4.tar.gz
Upload date: Jan 11, 2023
Size: 9.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for DataNormalizer-1.3.4.tar.gz
Algorithm	Hash digest
SHA256	`58fae1c90b513fe6818c038e092a15b2fd094dbcd6d926d21e86d41ab333453d`
MD5	`7a590ad59ba04183c1406798f6211b46`
BLAKE2b-256	`ea378c2bd71f80b78c53df52006aec6cf1aba39405717e9767db2c23d14979c0`

See more details on using hashes here.

File details

Details for the file DataNormalizer-1.3.4-py3-none-any.whl.

File metadata

Download URL: DataNormalizer-1.3.4-py3-none-any.whl
Upload date: Jan 11, 2023
Size: 10.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for DataNormalizer-1.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`577aa6fbada7db8918a1ee987b6478f4d6c0645792ccfdcb2363b58f6ff28c64`
MD5	`b04b254051c4d93986b0e3f3b9fd6be2`
BLAKE2b-256	`29417274491cfde30ca5302a74330d51cce33813cc05c79968376f4f4f20cff7`

See more details on using hashes here.

DataNormalizer 1.3.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

General Information

Setting variables

checkRules

obtainKeys

matchKeys

fixMismatch

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes