Skip to main content

Tools for cleaning pandas DataFrames

Project description

pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data.

Why janitor?

Originally a port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method chaining paradigm.

Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).

The pandas API has been invaluable for the Python data science ecosystem, and implements method chaining of a subset of methods as part of the API. For example, resetting indexes (.reset_index()), dropping null values (.dropna()), and more, are accomplished via the appropriate pd.DataFrame method calls.

Inspired by the ease-of-use and expressiveness of the dplyr package of the R statistical language ecosystem, we have evolved pyjanitor into a language for expressing the data processing DAG for pandas users.

Functionality

Current functionality includes:

  • Cleaning columns name (multi-indexes are possible!)

  • Removing empty rows and columns

  • Identifying duplicate entries

  • Encoding columns as categorical

  • Splitting your data into features and targets (for machine learning)

  • Adding, removing, and renaming columns

  • Coalesce multiple columns into a single column

  • Date conversions (from matlab, excel, unix) to Python datetime format

  • Expand a single column that has delimited, categorical values into dummy-encoded variables

  • Concatenating and deconcatenating columns, based on a delimiter

  • Syntactic sugar for filtering the dataframe based on queries on a column

  • Experimental submodules for finance, biology, chemistry, engineering, and pyspark

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyjanitor-0.20.10.tar.gz (82.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyjanitor-0.20.10-py3-none-any.whl (78.7 kB view details)

Uploaded Python 3

File details

Details for the file pyjanitor-0.20.10.tar.gz.

File metadata

  • Download URL: pyjanitor-0.20.10.tar.gz
  • Upload date:
  • Size: 82.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for pyjanitor-0.20.10.tar.gz
Algorithm Hash digest
SHA256 cde75bd4b67baedfdcf2c354a151bceecefb5f3b6501990c8777a5575c3a3f35
MD5 1b89939cce9e7f069021610e65a31e37
BLAKE2b-256 9d0f58d6fcb0db5f9e8a931f68ce07dc59734dff23f4b705c5ae09ff3a07e1b2

See more details on using hashes here.

File details

Details for the file pyjanitor-0.20.10-py3-none-any.whl.

File metadata

  • Download URL: pyjanitor-0.20.10-py3-none-any.whl
  • Upload date:
  • Size: 78.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for pyjanitor-0.20.10-py3-none-any.whl
Algorithm Hash digest
SHA256 52b29b85d1bac4816577f9323774b51e4d41ae85687e998e4f3a4ed452cb500b
MD5 d201f353bd59b4ac1da1c768f5726dcb
BLAKE2b-256 7a9bc8206d9f045568bdec6ef9d66aa82e92b53414299ea678cc0f7c1f0a80d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page