Tools for data cleansing specifically for Internal Auditors

These details have not been verified by PyPI

Project links

Project description

Introduction to Pydit

Pydit is a library of data wrangling tools aimed to internal auditors
specifically for our use cases.

This library is also a learning exercise for me on how to create a package, build documentation & tests, and publish it.
Code quality varies, and I don't commit to keep backward compatibility (see below how I use it) So, use it at your own peril!
If, despite all that, you wish to contribute, feel free to get in touch.

Shout out: Pydit takes ideas (and some code) from Pyjanitor, an awesome library.
Check it out!

Why a dedicated library for auditors?

The problem Pydit tries to solve is that a big part of our audit tests have to do with basic data quality checks (e.g. find duplicates or blanks) as they may flag potential fraud or systemic errors.

But to do those check I often end up pasting snippets from internet or reusing code from previous audits with no consistency or tests done.

What I really need is:

a) easy to review code, both code and execution (even for non-programmers)

b) portable, minimal dependencies, pure python, drop-in module ideally.

c) performance is ultimately secondary to readability and repeatability.

Pydit follows these principles:

Functions should be self-standing with minimal imports/dependencies.

The auditor should be able to import or copy->paste only a specfic module into the project to perform a particular the audit test. That makes it easier to undertand, customise, review. Plus, it removes dependencies of future versions of pydit. In any case, we need to keep on file the actual code used to perform the test.

Functions should include verbose logging, short of debug level.
Focus on documentation, tests and simple code, less concerns on performance.
No method chaining, in interest of source code readability.

While excellent libraries like Pyjanitor are great and its method chaining approach is elegant, my experience has been that the good old "step by step" approach works better for documenting the test and explaining it to reviewers. Plus, pyjanitor adds those methods directly to pandas objects which adds some complexity/coupling to the code, and has more library dependencies given than Pydit, given its quite extensive functionality.

Returns a new transformed copy of the object, code does not mutate the input object(s).

Quick start

import pandas as pd
from pydit import start_logging_info # sets up nice logging params with rotation
from pydit import profile_dataframe  # runs a few descriptive analysis on a df
from pydit import cleanup_column_names # opinionated cleanup of column names


logger = start_logging_info()
logger.info("Started")

The logger feature is used extensively by default, aiming to generate a human readable audit log to be included in workpapers.

I recommend importing individual functions so you can copy them locally to your project folder and just change the import command to point to the local module, that way you freeze the version and reduce dependencies.

df=pd.read_excel("mydata.xlsx")

df_profile= profile_dataframe(df) # will return a df with summary statistics

# you may realise the columns from excel are all over the place with cases and
# special chars

df_clean= cleanup_column_names(df) 

df_deduped=check_duplicates(df_clean, columns=["customer_id","last_update_date"],ascending=[True,False],keep="first",indicator=True, also_return_non_duplicates=True)

# you will get a nice output with the report on duplicates, retaining the last
# modification entry (via the pre-sort descending by date) and returning 
# the non-duplicates,  
# It also brings a boolean column flagging those that had a duplication removed.

Requires

python >=3.14
pandas
numpy
matplotlib (for the ocassional plot, e.g. Benford)

Installation

pip install pydit-jceresearch

(not available in anaconda yet)

Documentation

Documentation can be found here

Dev Install

git clone https://github.com/jceresearch/pydit.git
pip install -e .

This project uses:

pylint for linting
ruff for style
pytest for testing
sphinx for documentation in RTD
myst_parser is a requirement for RTD too

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Feb 15, 2026

0.1.8

Oct 24, 2024

0.1.7

Aug 19, 2024

0.1.6

Mar 25, 2024

0.1.5

Dec 11, 2023

0.1.4

Nov 9, 2023

0.1.3

Oct 4, 2023

0.1.2

Aug 31, 2023

0.1.1

Aug 26, 2023

0.0.17

Jul 22, 2023

0.0.16

Jun 25, 2023

0.0.15

May 29, 2023

0.0.14

May 20, 2023

0.0.13

Apr 9, 2023

0.0.12

Mar 27, 2023

0.0.11

Jan 14, 2023

0.0.10

Oct 29, 2022

0.0.9

Aug 20, 2022

0.0.8

Jul 16, 2022

0.0.7

Jul 3, 2022

0.0.6

Jun 19, 2022

0.0.5

Jun 12, 2022

0.0.4

May 16, 2022

0.0.3

May 15, 2022

0.0.2

May 15, 2022

0.0.1

May 14, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydit_jceresearch-0.2.0.tar.gz (181.5 kB view details)

Uploaded Feb 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydit_jceresearch-0.2.0-py3-none-any.whl (61.1 kB view details)

Uploaded Feb 15, 2026 Python 3

File details

Details for the file pydit_jceresearch-0.2.0.tar.gz.

File metadata

Download URL: pydit_jceresearch-0.2.0.tar.gz
Upload date: Feb 15, 2026
Size: 181.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pydit_jceresearch-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`eb9223eafbb3b17273c04983199095600b0bb69f5476405f68ccf05931a451bc`
MD5	`e056ddad7de38232b924d5e5459397d7`
BLAKE2b-256	`3ade58b268169387d0fd6cf666fe56e8b84765540c671698a7e38f918dd5ffea`

See more details on using hashes here.

File details

Details for the file pydit_jceresearch-0.2.0-py3-none-any.whl.

File metadata

Download URL: pydit_jceresearch-0.2.0-py3-none-any.whl
Upload date: Feb 15, 2026
Size: 61.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pydit_jceresearch-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3e56b043bde5ca5fedd83eccc27a100af6e18617a5d97618acfc9b19d6b1798`
MD5	`7ad770c6548780af30c66d27c4b1c90a`
BLAKE2b-256	`b8ba1abc14b20eae53148e436dd5114873ef9433d3c338a0678efaced6b6d293`

See more details on using hashes here.

pydit-jceresearch 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Introduction to Pydit

Why a dedicated library for auditors?

Quick start

Requires

Installation

Documentation

Dev Install

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes