Tools for cleaning pandas DataFrames
Project description
pyjanitor
is a Python implementation of the R package janitor
, and
provides a clean API for cleaning data.
Quick start
- Installation:
conda install -c conda-forge pyjanitor
. Read more installation instructions here. - Check out the collection of general functions.
Why janitor?
Originally a port of the R package,
pyjanitor
has evolved from a set of convenient data cleaning routines
into an experiment with the method chaining
paradigm.
Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).
The pandas
API has been invaluable for the Python data science ecosystem,
and implements method chaining of a subset of methods as part of the API.
For example, resetting indexes (.reset_index()
),
dropping null values (.dropna()
), and more,
are accomplished via the appropriate pd.DataFrame
method calls.
Inspired by the ease-of-use
and expressiveness of the dplyr
package
of the R statistical language ecosystem,
we have evolved pyjanitor
into a language
for expressing the data processing DAG for pandas
users.
Installation
pyjanitor
is currently installable from PyPI:
pip install pyjanitor
pyjanitor
also can be installed by the conda package manager:
conda install pyjanitor -c conda-forge
pyjanitor
can be installed by the pipenv environment manager too. This requires enabling prerelease dependencies:
pipenv install --pre pyjanitor
pyjanitor
requires Python 3.6+.
Functionality
Current functionality includes:
- Cleaning columns name (multi-indexes are possible!)
- Removing empty rows and columns
- Identifying duplicate entries
- Encoding columns as categorical
- Splitting your data into features and targets (for machine learning)
- Adding, removing, and renaming columns
- Coalesce multiple columns into a single column
- Date conversions (from matlab, excel, unix) to Python datetime format
- Expand a single column that has delimited, categorical values into dummy-encoded variables
- Concatenating and deconcatenating columns, based on a delimiter
- Syntactic sugar for filtering the dataframe based on queries on a column
- Experimental submodules for finance, biology, chemistry, engineering, and pyspark
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyjanitor-0.29.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9eeb9bb43fbc545170ae757c2d2dd14ce9ed1778b191621abe78062362191ad3 |
|
MD5 | ad4238f085fec2ad41701d929d702701 |
|
BLAKE2b-256 | 4a237e3e399ed87724da29c6fc6fec43e6de7fb7fd8ff0ac8b1e5ed054cd305c |