Skip to main content

A package designed to syntactically mimic the tidyr R package

Project description

tidypython - A simple python package designed to syntactically mimic the tidyr package in R.

Install the package with pip:

pip install tidypython

Load the functions into your script with:

from tidypython import *

You will then have access to the functions:

gather(*args, **kwargs)
spread(*args, **kwargs)
separate(*args, **kwargs)

The syntax is designed to resemble that of the R package tidyr as closely as possible. All of the functionality is not yet fully implemented, but the basics are there.

All of these functions are designed to work with the dplython package operators >>.

This package is currently in development, please visit the github page if you'd like to contribute: https://github.com/durrantmm/tidypython

Examples

First, import the dplython package:

from dplython import *

Make sure that your dataframe is a DplyFrame object. You can make sure it is by calling:

df = DplyFrame(df)

Or you can read in your file directly using the readpy package:

from readpy import *
df = read_tsv("myfile.tsv")

gather()

The gather() command implements the pandas melt() function using the tidyr syntax.

You can use the gather command as:

df >> gather(X.key, X.value, X.column1, X.column2...)

X.key and X.value are used to determine the new names of the key and value columns that will be created.

By default, this will use column1, column2, and all other subsequent columns to determine the keys and the values. All unspecified columns will be used simply as an index. Alternatively, you can use the syntax

df >> gather(X.key, X.value, X.column1, X.column2, exclude=True)

Which will make column1, and column2 the index, and all other unspecified columns will be used for the key and value columns.

Using the mtcars data as an example:

>>> mtcars = read_tsv('mtcars.tsv')
>>> print(mtcars >> head())
                name   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2

Lets first gather all of the columns of interest by inclusion:

>>> mtcars_gathered_inclusion = mtcars >> \
... gather(X.info, X.val, X.mpg, X.cyl, X.disp, X.hp, X.drat, X.wt, X.qsec, X.vs, X.am, X.gear, X.carb))
>>> print(mtcars_gathered_inclusion >> head())
                name info   val
0          Mazda RX4  mpg  21.0
1      Mazda RX4 Wag  mpg  21.0
2         Datsun 710  mpg  22.8
3     Hornet 4 Drive  mpg  21.4
4  Hornet Sportabout  mpg  18.7
>>> print(mtcars_gathered_inclusion >> tail())
               name  info  val
347    Lotus Europa  carb  2.0
348  Ford Pantera L  carb  4.0
349    Ferrari Dino  carb  6.0
350   Maserati Bora  carb  8.0
351      Volvo 142E  carb  2.0

Now we can do it by exclusion, which is much shorter in this case:

>>> mtcars_gathered_exclusion = mtcars >> \
... gather(X.info, X.val, X.name, exclude=True))
>>> print(mtcars_gathered_exclusion >> head())
                name info   val
0          Mazda RX4  mpg  21.0
1      Mazda RX4 Wag  mpg  21.0
2         Datsun 710  mpg  22.8
3     Hornet 4 Drive  mpg  21.4
4  Hornet Sportabout  mpg  18.7
>>> print(mtcars_gathered_exclusion >> tail())
               name  info  val
347    Lotus Europa  carb  2.0
348  Ford Pantera L  carb  4.0
349    Ferrari Dino  carb  6.0
350   Maserati Bora  carb  8.0
351      Volvo 142E  carb  2.0

You can see that it functions very much in the same manner as the tidyr::gather function.

spread()

The spread() command implements the pandas pivot() function using the tidyr syntax.

You can use the spread command as:

df >> spread(X.key, X.value)

X.key and X.value are used to specify the existing columns that are pivoted, and all other unused columns are assumed to be the index.

Which will make column1, and column2 the index, and all other unspecified columns will be used for the key and value columns.

Using the mtcars data as an example:

>>> mtcars = read_tsv('mtcars.tsv')
>>> print(mtcars >> head())
                name   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2

Lets first gather all of the columns of interest by exclusion:

>>> mtcars_gathered_exclusion = mtcars >> \
... gather(X.info, X.val, X.name, exclude=True))
>>> print(mtcars_gathered_exclusion >> head())
                name info   val
0          Mazda RX4  mpg  21.0
1      Mazda RX4 Wag  mpg  21.0
2         Datsun 710  mpg  22.8
3     Hornet 4 Drive  mpg  21.4
4  Hornet Sportabout  mpg  18.7

We can then pivot the info and val columns out by using the spread function:

>>> print(mtcars_gathered_exclusion >> spread(X.info, X.val) >> head())
                 name   mpg  cyl   disp     hp  drat     wt   qsec   vs   am  gear   carb
0         AMC Javelin  15.2  8.0  304.0  150.0  3.15  3.435  17.30  0.0  0.0   3.0    2.0
1  Cadillac Fleetwood  10.4  8.0  472.0  205.0  2.93  5.250  17.98  0.0  0.0   3.0    4.0
2          Camaro Z28  13.3  8.0  350.0  245.0  3.73  3.840  15.41  0.0  0.0   3.0    4.0
3   Chrysler Imperial  14.7  8.0  440.0  230.0  3.23  5.345  17.42  0.0  0.0   3.0    4.0
4          Datsun 710  22.8  4.0  108.0   93.0  3.85  2.320  18.61  1.0  1.0   4.0    1.0

You can see that it functions very much in the same manner as the tidyr::gather function. As currently implemented, the order of the columns will be preserved in python >= 3.6, but the order of the index will not.

separate()

The separate() command doesn't have a direct parallel in other python packages that I am aware of.

You can use the separate command as:

df >> separate(X.column, into, sep=myseperator)

X.column is the column that you want to split, into is a list of the new column names for the split columns, and sep is a regex-expression used to split the X.column. By default, this will split by [^\w]+.

Let's say that our mtcars dataframe only included a name, mpg, and cyl joined in a single column by the separator '|':

>>> print(mtcars_messy >> head())
                       name  
0          Mazda RX4|21.0|6
1       Mazda RX4 Wag|1.0|6
2         Datsun 710|22.8|4
3     Hornet 4 Drive|21.4|6
4  Hornet Sportabout|18.7|8

You could seperate this column into three columns using the command:

>>> mtcars_clean = mtcars_messy >> separate(X.name, ['name', 'mpg', 'cyl'], sep='\|')
>>> print(mtcars_clean >> head())
                name   mpg cyl
0          Mazda RX4  21.0   6
1      Mazda RX4 Wag  21.0   6
2         Datsun 710  22.8   4
3     Hornet 4 Drive  21.4   6
4  Hornet Sportabout  18.7   8

Note that all of the columns remain strings after separating. It will also not give any warnings if the number of columns specified does not match the number of strings after splitting.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidypython-0.0.1.dev3.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

tidypython-0.0.1.dev3-py2.py3-none-any.whl (6.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file tidypython-0.0.1.dev3.tar.gz.

File metadata

  • Download URL: tidypython-0.0.1.dev3.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.1

File hashes

Hashes for tidypython-0.0.1.dev3.tar.gz
Algorithm Hash digest
SHA256 3fcdfd5469a51cc06146f8c123c5c443a767da607e76ad128f4ce29cb9dd01af
MD5 a814e311960367567f71dcd80b98691f
BLAKE2b-256 8a62e591d9f7f0b3878f120945c16d2b7c92b316fbdec8924b99b93c7a3ddccc

See more details on using hashes here.

File details

Details for the file tidypython-0.0.1.dev3-py2.py3-none-any.whl.

File metadata

  • Download URL: tidypython-0.0.1.dev3-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.1

File hashes

Hashes for tidypython-0.0.1.dev3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 7a72657dc991a99c42fddd2439705b920abeb7e98958300022e9446d52655d5c
MD5 5d5b4a7befb9bd6708a208eaa12d8b0f
BLAKE2b-256 9ec892e68c5c9e3c79dde038660bf2b91f3d445c77879bc7f76df0dabea7c789

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page