pydit-jceresearch

Data cleansing tools for Internal Auditors

These details have not been verified by PyPI

Project links

Documentation

Project description

Introduction to Pydit

Pydit is a library of data wrangling tools for use by internal auditors specifically for our typical use cases, see below explanation.

This library is also a learning exercise for me on how to create a package, build documentation & tests, and publish it. Code quality varies, and due to its main use case I don't commit to keep backward compatibility (see below) So, use it at your own peril! If, despite all that, you wish to contribute, feel free to get in touch.

Shout out: Pydit takes ideas (and some code) from Pyjanitor, an awesome library. check it out!

Why a dedicated library for auditors?

The problem Pydit tries to solve is that all these cleanup and checks (e.g. extract duplicates) snippets are quite important for our work and start to crop up everywhere, often pasted from internet or from recent version used in another script with no consistency or tests.

On the other hand, libraries like pyjanitor do a great job but a) require installation that often is not allowed in your environment b) tend to be compact and non verbose (and use method chaining) and c) are difficult to verify given the high complexity of the library overall.

For internal audit tests, what we really need is very verbose and easy to understand code and outputs, so it is almost self explanatory and easy to review. Most of the time, performance is secondary. We just need it to run a few times for the duration of the audit.

This leads to Pydit following these principles:

Functions should be self-standing with minimal imports/dependencies.

The auditor should be able to import any individual module to use only those functions in the audit test. That makes it easier to undertand, document and peer-review. Also, it reduces dependencies of future versions of pydit. Typically, we need file the code used as it was ran during the audit.

Functions include verbose logging to explain what is going on. Another feature specifically useful for the Internal Audit use case.
Focus on documentation, tests, and simple code, less concern on performance.
No method chaining, in interest of source code readability.

Pyjanitor is great and its chaining approach is elegant and compact. Definitely one to have in the toolbox. However, I have found it better for documenting the audit test, to check and show all the intermediate steps/results.

The default behaviour is to return a new or a transformed copy of the object and not mutate the input object(s). The "inplace=True" option should be available if feasible.

Quick start

import pandas as pd
from pydit import start_logging_info # sets up nice logging params with rotation
from pydit import profile_dataframe  # runs a few descriptive analysis on a df
from pydit import cleanup_column_names # opinionated cleanup of column names


logger = start_logging_info()
logger.info("Started")

The logger feature is used extensively by default, aiming to generate a human readable audit log to be included in workpapers.

I recommend importing individual functions so you can copy them locally to your project folder and just change the import command to point to the local module, that way you freeze the version and reduce dependencies.

df=pd.read_excel("mydata.xlsx")

df_profile= profile_dataframe(df) # will return a df with summary statistics

# you may realise the columns from excel are all over the place with cases and
# special chars

cleanup_column_names(df,inplace=True) # much better!!!

df_deduped=check_duplicates(df, columns=["customer_id","last_update_date"],ascending=[True,False],keep="first",indicator=True, also_return_non_duplicates=True)

# you will get a nice output with the report on duplicates, retaining the last
# modification entry (via the pre-sort descending by date) and returning 
# the non-duplicates,  
# It also brings a boolean column flagging those that had a duplication removed.

Requires

Python >= 3.10
Pandas >= 1.5.0
Numpy >= 1.24
openpyxl
Matplotlib (for the ocassional plot, e.g. Benford)

Installation

pip install pydit

(not available in anaconda yet)

Documentation

Documentation can be found here

Dev Install

git clone https://github.com/jceresearch/pydit.git
pip install -e .

This project uses:

pylint for linting
black for style
pytest for testing
sphinx for documentation in RTD
myst_parser is a requirement for RTD too
poetry for packaging.

Project details

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

0.1.8

Oct 24, 2024

0.1.7

Aug 19, 2024

0.1.6

Mar 25, 2024

0.1.5

Dec 11, 2023

0.1.4

Nov 9, 2023

0.1.3

Oct 4, 2023

0.1.2

Aug 31, 2023

0.1.1

Aug 26, 2023

0.0.17

Jul 22, 2023

0.0.16

Jun 25, 2023

This version

0.0.15

May 29, 2023

0.0.14

May 20, 2023

0.0.13

Apr 9, 2023

0.0.12

Mar 27, 2023

0.0.11

Jan 14, 2023

0.0.10

Oct 29, 2022

0.0.9

Aug 20, 2022

0.0.8

Jul 16, 2022

0.0.7

Jul 3, 2022

0.0.6

Jun 19, 2022

0.0.5

Jun 12, 2022

0.0.4

May 16, 2022

0.0.3

May 15, 2022

0.0.2

May 15, 2022

0.0.1

May 14, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydit_jceresearch-0.0.15.tar.gz (44.6 kB view details)

Uploaded May 29, 2023 Source

Built Distribution

pydit_jceresearch-0.0.15-py3-none-any.whl (54.5 kB view details)

Uploaded May 29, 2023 Python 3

File details

Details for the file pydit_jceresearch-0.0.15.tar.gz.

File metadata

Download URL: pydit_jceresearch-0.0.15.tar.gz
Upload date: May 29, 2023
Size: 44.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.10.11 Linux/5.15.0-1037-azure

File hashes

Hashes for pydit_jceresearch-0.0.15.tar.gz
Algorithm	Hash digest
SHA256	`0caf677b56a0d4741614c0c26536b62638d23684dabe5ba34893ff2016c62b0f`
MD5	`cdeab85a53b524bd805a3f4ea6fce5ac`
BLAKE2b-256	`f4894c7129d5597ff4390d303b57ddc5b4eea3d0609b2ff811290f3d559c80ba`

See more details on using hashes here.

File details

Details for the file pydit_jceresearch-0.0.15-py3-none-any.whl.

File metadata

Download URL: pydit_jceresearch-0.0.15-py3-none-any.whl
Upload date: May 29, 2023
Size: 54.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.10.11 Linux/5.15.0-1037-azure

File hashes

Hashes for pydit_jceresearch-0.0.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2746e5353905c2de6570361c87c8ef216c5e46859b1ef2403e77f29eeee9d8e9`
MD5	`098f5fbdaff4a905aa620c0debd6e94f`
BLAKE2b-256	`652e326b589662e211c18046a1e889aedd5e41830b42dc7a3a563eb76399cb2e`