Tools for cleaning pandas DataFrames
pyjanitor is a Python implementation of the R package
provides a clean API for cleaning data.
conda install -c conda-forge pyjanitor. Read more installation instructions here.
- Check out the collection of general functions.
Originally a port of the R package,
pyjanitor has evolved from a set of convenient data cleaning routines
into an experiment with the
method chaining paradigm.
Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).
pandas API has been invaluable for the Python data science ecosystem,
and implements method chaining of a subset of methods as part of the API.
For example, resetting indexes (
dropping null values (
.dropna()), and more,
are accomplished via the appropriate
pd.DataFrame method calls.
Inspired by the ease-of-use
and expressiveness of the
of the R statistical language ecosystem,
we have evolved
pyjanitor into a language
for expressing the data processing DAG for
pyjanitor is currently installable from PyPI:
pip install pyjanitor
pyjanitor also can be installed by the conda package manager:
conda install pyjanitor -c conda-forge
pyjanitor can be installed by the pipenv environment manager too. This requires enabling prerelease dependencies:
pipenv install --pre pyjanitor
pyjanitor requires Python 3.6+.
Current functionality includes:
- Cleaning columns name (multi-indexes are possible!)
- Removing empty rows and columns
- Identifying duplicate entries
- Encoding columns as categorical
- Splitting your data into features and targets (for machine learning)
- Adding, removing, and renaming columns
- Coalesce multiple columns into a single column
- Date conversions (from matlab, excel, unix) to Python datetime format
- Expand a single column that has delimited, categorical values into dummy-encoded variables
- Concatenating and deconcatenating columns, based on a delimiter
- Syntactic sugar for filtering the dataframe based on queries on a column
- Experimental submodules for finance, biology, chemistry, engineering, and pyspark
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for pyjanitor-0.26.0-py3-none-any.whl