Skip to main content

ph - the tabular data shell tool

Project description

ph - the tabular data shell tool

Spoiler: Working with tabular data in the command line is difficult. ph makes it easy:

$ pip install ph
$ cat iris.csv | ph columns 4 150 | ph head 15 | ph tail 5 | ph tabulate --headers
      4    150
--  ---  -----
 0  3.7    5.4
 1  3.4    4.8
 2  3      4.8
 3  3      4.3
 4  4      5.8
$ cat iris.csv | ph describe
              150           4      setosa  versicolor   virginica
count  150.000000  150.000000  150.000000  150.000000  150.000000
mean     5.843333    3.057333    3.758000    1.199333    1.000000
std      0.828066    0.435866    1.765298    0.762238    0.819232
min      4.300000    2.000000    1.000000    0.100000    0.000000
25%      5.100000    2.800000    1.600000    0.300000    0.000000
50%      5.800000    3.000000    4.350000    1.300000    1.000000
75%      6.400000    3.300000    5.100000    1.800000    2.000000
max      7.900000    4.400000    6.900000    2.500000    2.000000

Raison d'être

Using the pipeline in Linux is nothing short of a dream in the life of the computer super user.

However the pipe is clearly most suited for a stream of lines of textual data, and not when the stream is actually tabular data.

Tabular data is much more complex to work with due to its dual indexing and the fact that we often read horizontally and often read vertically.

The defacto format for tabular data is csv (which is not perfect in any sense of the word), and the defacto tool for working with tabular data in Python is Pandas.

This is a shell utility ph that reads tabular data from standard in and allows you to perform a pandas function on the data, before writing it to standard out in csv format.

The goal is to create a tool which makes it nicer to work with tabular data in a pipeline.

Getting started

If you have installed ph[data], you can try out the examples above using

ph dataset boston | ph describe

Available datasets are from scikit-learn.datasets

Toy datasets:

  • boston
  • iris
  • diabetes
  • digits
  • linnerud
  • wine
  • breast_cancer

Real world:

  • olivetti_faces
  • 20newsgroups
  • 20newsgroups_vectorized
  • lfw_people
  • lfw_pairs
  • covtype
  • rcv1
  • kddcup99
  • california_housing

Example usage

Suppose you have a csv file a.csv that looks like this:

x,y
3,8
4,9
5,10
6,11
7,12
8,13

Transpose:

$ cat a.csv | ph transpose
0,1,2,3,4,5
3,4,5,6,7,8
8,9,10,11,12,13

median (as well as many others, e.g. abs, corr, count, cov, cummax, cumsum, diff, max, product, quantile, rank, round, sum, std, var etc.):

$ cat a.csv | ph median
0
5.5
10.5

Use ph help to list all commands

Using head and tail works approximately as the normal shell equivalents, however they will preserve the header if there is one, e.g.

$ cat a.csv | ph head 7 | ph tail 3
x,y
6,11
7,12
8,13

If the csv file contains a column, e.g. named x containing timestamps, it can be parsed as such with ph date x:

$ cat a.csv | ph date x
x,y
1970-01-01 00:00:00.000000003,8
1970-01-01 00:00:00.000000004,9
1970-01-01 00:00:00.000000005,10
1970-01-01 00:00:00.000000006,11
1970-01-01 00:00:00.000000007,12
1970-01-01 00:00:00.000000008,13

The normal Pandas describe is of course available:

$ cat a.csv | ph describe
              x          y
count  6.000000   6.000000
mean   5.500000  10.500000
std    1.870829   1.870829
min    3.000000   8.000000
25%    4.250000   9.250000
50%    5.500000  10.500000
75%    6.750000  11.750000
max    8.000000  13.000000

Print the column names:

$ cat a.csv | ph columns
x
y

Selecting only certain columns, e.g. a and b

$ cat a.csv | ph columns x y
x,y
3,8
4,9
5,10
6,11
7,12
8,13

Rename:

$ cat a.csv | ph rename x a | ph rename y b
a,b
3,8
4,9
5,10
6,11
7,12
8,13

You can sum two columns x and y and place the result in column z using eval (from pandas.DataFrame.eval).

$ cat a.csv | ph eval "z = x + y"
x,y,z
3,8,11
4,9,13
5,10,15
6,11,17
7,12,19
8,13,21
$ cat a.csv | ph eval "z = x**2 + y"
x,y,z
3,8,17
4,9,25
5,10,35
6,11,47
7,12,61
8,13,77

If you only want the sum of two columns, then, you can pipe the last two using

$ cat a.csv | ph eval "z = x + y" | ph columns z
z
11
13
15
17
19
21

You can normalize a column using ph normalize col.

$ cat a.csv | ph eval "z = x * y" | ph normalize z
x,y,z
3,8,0.0
4,9,0.15
5,10,0.325
6,11,0.525
7,12,0.75
8,13,1.0

We can query data using ph query expr.

$ cat a.csv | ph query "x > 5"
x,y
6,11
7,12
8,13

Warning: Calling the following command might be illegal in Norway. Reader beware!

$ ph open csv 'http://bit.ly/2cLzoxH' | ph query "country == 'Norway'" | ph tabulate --headers
    country      year          pop  continent      lifeExp    gdpPercap
--  ---------  ------  -----------  -----------  ---------  -----------
 0  Norway       1952  3.32773e+06  Europe          72.67       10095.4
 1  Norway       1957  3.49194e+06  Europe          73.44       11654
 2  Norway       1962  3.63892e+06  Europe          73.47       13450.4
 3  Norway       1967  3.78602e+06  Europe          74.08       16361.9
 4  Norway       1972  3.933e+06    Europe          74.34       18965.1
 5  Norway       1977  4.04320e+06  Europe          75.37       23311.3
 6  Norway       1982  4.11479e+06  Europe          75.97       26298.6
 7  Norway       1987  4.18615e+06  Europe          75.89       31541
 8  Norway       1992  4.28636e+06  Europe          77.32       33965.7
 9  Norway       1997  4.40567e+06  Europe          78.32       41283.2
10  Norway       2002  4.53559e+06  Europe          79.05       44684
11  Norway       2007  4.62793e+06  Europe          80.196      49357.2

Tabulate

The amazing tabulate tool comes from the Python package tabulate on PyPI.

The tabulate command takes arguments --headers to toggle printing of header row, --format=[grid,...] to modify the table style and --noindex to remove the running index (leftmost column in the example above).

Among the supported format styles are

  • plain, simple,
  • grid, fancy_grid, pretty,
  • github, rst, mediawiki, html, latex,
  • ... (See full list at the project homepage at python-tabulate.)

Plotting data

You can plot data using ph plot [index].

$ ph open parquet 1A_2019.parquet | ph columns Time Value | ph plot Time

This will take the columns Time and Value from the timeseries provided by the given parquet file and plot the Value series using Time as index.

The following example plots the life expectancy in Norway using year as index:

$ ph open csv http://bit.ly/2cLzoxH  | ph query "country == 'Norway'" | ph appendstr year -01-01 | ph columns year lifeExp | ph plot year

life-expectancy over time

The strange ph appendstr year -01-01 turns the items 1956 into "1956-01-01" and 2005 into "2005-01-01". These are necessary to make pandas to interpret 1956 as a year and not as a millisecond.

The command ph appendstr col str [newcol] takes a string and appends it to a column, overwriting the original column, or writing it to newcol if provided.

Working with different file types

Pandas supports reading a multitude of readers.

To read an Excel file and pipe the stream, you can use ph open.

The syntax of ph open is ph open ftype fname, where fname is the file you want to stream and ftype is the type of the file.

A list of all available formats is given below.

$ ph open xls a.xlsx
x,y
3,8
4,9
5,10
6,11
7,12
8,13
  • csv / tsv (the latter for tab-separated values)
  • fwf (fixed-width file format)
  • json
  • html
  • clipboard (pastes tab-separated content from clipboard)
  • xls
  • odf
  • hdf5
  • feather
  • parquet
  • orc
  • stata
  • sas
  • spss
  • pickle
  • sql
  • gbq / google / bigquery

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ph-0.0.22.tar.gz (27.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page