ph - the tabular data shell tool
Project description
ph - the tabular data shell tool
Spoiler: Working with tabular data in the command line is difficult. ph
makes
it easy:
$ pip install ph
$ cat iris.csv | ph columns 4 150 | ph head 15 | ph tail 5 | ph tabulate --headers
4 150
-- --- -----
0 3.7 5.4
1 3.4 4.8
2 3 4.8
3 3 4.3
4 4 5.8
$ cat iris.csv | ph describe
150 4 setosa versicolor virginica
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333 1.000000
std 0.828066 0.435866 1.765298 0.762238 0.819232
min 4.300000 2.000000 1.000000 0.100000 0.000000
25% 5.100000 2.800000 1.600000 0.300000 0.000000
50% 5.800000 3.000000 4.350000 1.300000 1.000000
75% 6.400000 3.300000 5.100000 1.800000 2.000000
max 7.900000 4.400000 6.900000 2.500000 2.000000
Raison d'être
Using the pipeline in Linux is nothing short of a dream in the life of the computer super user.
However the pipe is clearly most suited for a stream of lines of textual data, and not when the stream is actually tabular data.
Tabular data is much more complex to work with due to its dual indexing and the fact that we often read horizontally and often read vertically.
The defacto format for tabular data is csv
(which is not perfect in any sense
of the word), and the defacto tool for working with tabular data in Python is
Pandas.
This is a shell utility ph
that reads tabular data from standard in and allows
you to perform a pandas function on the data, before writing it to standard out
in csv
format.
The goal is to create a tool which makes it nicer to work with tabular data in a pipeline.
Example usage
Suppose you have a csv file a.csv
that looks like this:
x,y
3,8
4,9
5,10
6,11
7,12
8,13
Transpose:
$ cat a.csv | ph transpose
0,1,2,3,4,5
3,4,5,6,7,8
8,9,10,11,12,13
median
(as well as many others, e.g. abs
, corr
, count
, cov
, cummax
,
cumsum
, diff
, max
, product
, quantile
, rank
, round
, sum
, std
,
var
etc.):
$ cat a.csv | ph median
0
5.5
10.5
Use ph help
to list all commands
Using head
and tail
works approximately as the normal shell equivalents,
however they will preserve the header if there is one, e.g.
$ cat a.csv | ph head 7 | ph tail 3
x,y
6,11
7,12
8,13
If the csv
file contains a column, e.g. named x
containing timestamps, it
can be parsed as such with ph date x
:
$ cat a.csv | ph date x
x,y
1970-01-01 00:00:00.000000003,8
1970-01-01 00:00:00.000000004,9
1970-01-01 00:00:00.000000005,10
1970-01-01 00:00:00.000000006,11
1970-01-01 00:00:00.000000007,12
1970-01-01 00:00:00.000000008,13
The normal Pandas describe
is of course available:
$ cat a.csv | ph describe
x y
count 6.000000 6.000000
mean 5.500000 10.500000
std 1.870829 1.870829
min 3.000000 8.000000
25% 4.250000 9.250000
50% 5.500000 10.500000
75% 6.750000 11.750000
max 8.000000 13.000000
Print the column names:
$ cat a.csv | ph columns
x
y
Selecting only certain columns, e.g. a
and b
$ cat a.csv | ph columns x y
x,y
3,8
4,9
5,10
6,11
7,12
8,13
Rename:
$ cat a.csv | ph rename x a | ph rename y b
a,b
3,8
4,9
5,10
6,11
7,12
8,13
You can sum two columns x
and y
and place the result in column z
using
eval
(from
pandas.DataFrame.eval
).
$ cat a.csv | ph eval "z = x + y"
x,y,z
3,8,11
4,9,13
5,10,15
6,11,17
7,12,19
8,13,21
$ cat a.csv | ph eval "z = x**2 + y"
x,y,z
3,8,17
4,9,25
5,10,35
6,11,47
7,12,61
8,13,77
If you only want the sum of two columns, then, you can pipe the last two using
$ cat a.csv | ph eval "z = x + y" | ph columns z
z
11
13
15
17
19
21
You can normalize a column using ph normalize col
.
$ cat a.csv | ph eval "z = x * y" | ph normalize z
x,y,z
3,8,0.0
4,9,0.15
5,10,0.325
6,11,0.525
7,12,0.75
8,13,1.0
We can query data using ph query expr
.
$ cat a.csv | ph query "x > 5"
x,y
6,11
7,12
8,13
Warning: Calling the following command might be illegal in Norway. Reader beware!
$ ph open csv 'http://bit.ly/2cLzoxH' | ph query "country == 'Norway'" | ph tabulate --headers
country year pop continent lifeExp gdpPercap
-- --------- ------ ----------- ----------- --------- -----------
0 Norway 1952 3.32773e+06 Europe 72.67 10095.4
1 Norway 1957 3.49194e+06 Europe 73.44 11654
2 Norway 1962 3.63892e+06 Europe 73.47 13450.4
3 Norway 1967 3.78602e+06 Europe 74.08 16361.9
4 Norway 1972 3.933e+06 Europe 74.34 18965.1
5 Norway 1977 4.04320e+06 Europe 75.37 23311.3
6 Norway 1982 4.11479e+06 Europe 75.97 26298.6
7 Norway 1987 4.18615e+06 Europe 75.89 31541
8 Norway 1992 4.28636e+06 Europe 77.32 33965.7
9 Norway 1997 4.40567e+06 Europe 78.32 41283.2
10 Norway 2002 4.53559e+06 Europe 79.05 44684
11 Norway 2007 4.62793e+06 Europe 80.196 49357.2
Tabulate
The amazing tabulate tool comes from the Python package tabulate on PyPI.
The tabulate
command takes arguments --headers
to toggle printing of header
row, --format=[grid,...]
to modify the table style and --noindex
to remove
the running index (leftmost column in the example above).
Among the supported format styles are
plain
,simple
,grid
,fancy_grid
,pretty
,github
,rst
,mediawiki
,html
,latex
,- ... (See full list at the project homepage at python-tabulate.)
Plotting data
You can plot data using ph plot [index]
.
$ ph open parquet 1A_2019.parquet | ph columns Time Value | ph plot Time
This will take the columns Time
and Value
from the timeseries provided by
the given parquet
file and plot the Value
series using Time
as index.
The following example plots the life expectancy in Norway using year
as index:
$ ph open csv http://bit.ly/2cLzoxH | ph query "country == 'Norway'" | ph appendstr year -01-01 | ph columns year lifeExp | ph plot year
The strange ph appendstr year -01-01
turns the items 1956
into
"1956-01-01"
and 2005
into "2005-01-01"
. These are necessary to make
pandas to interpret 1956
as a year and not as a millisecond.
The command ph appendstr col str [newcol]
takes a string and appends it to a
column, overwriting the original column, or writing it to newcol
if provided.
Working with different file types
Pandas supports reading a multitude of readers.
To read an Excel file and pipe the stream, you can use ph open
.
The syntax of ph open
is ph open ftype fname
, where fname
is the
file you want to stream and ftype
is the type of the file.
A list of all available formats is given below.
$ ph open xls a.xlsx
x,y
3,8
4,9
5,10
6,11
7,12
8,13
csv
/tsv
(the latter for tab-separated values)fwf
(fixed-width file format)json
html
clipboard
(pastes tab-separated content from clipboard)xls
odf
hdf5
feather
parquet
orc
stata
sas
spss
pickle
sql
gbq
/google
/bigquery
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.