Skip to main content

A command line visualization utility for SQL using Pandas library in Python.

Project description

GitHub Page
============

Pandas-via-psql (ppsqlviz) is a command line visualization utility for SQL using Pandas library in Python.
Please visit the GitHub page [ppsqlviz](http://vatsan.github.io/pandas_via_psql/) for a complete tutorial.

PSQL + Pandas Awesomeness
==========================

[Pandas](http://pandas.pydata.org/) is a popular library in Python that is commonly used for data analysis and it provides Python equivalent of the R dataframe that is fundamental to data analysis. Some engineers and data scientists however are increasingly adopting SQL based libraries for building large scale machine learning algorithms. [MADlib](http://madlib.net) is one such library for scalable, parallel, in-database machine learning.

While there are commercial tools to visualize data that reside in databases (example: Tableau), often what's missing in a Big Data scientist's arsenal is a command line tool to be able to quickly visualize the output of a SQL query, without having to switch to a commercial tool or have to use a wrapper to a SQL engine. The pandas_via_psql (ppsqlviz) will show you how simple it is to redirect the output of a SQL query to some boilerplate Pandas's plotting functions, to quickly visualize the data from the command line.

Pre-Requisites
==============

ppsqlviz depends on the Pandas python library. You should also have [PSQL](http://www.postgresql.org/docs/8.1/static/app-psql.html) or a similar SQL command line interface to connect to your database and also ensure that you have password-less access to your remote database (set up SSH keys appropriately).

I recommend you download [Anaconda Python](https://store.continuum.io/cshop/anaconda/) from the nice folks at [Continuum Analytics](http://continuum.io/). It's got most of the essential Python scientific computing libraries pre-packaged and with [conda](http://bokeh.pydata.org/) you can save a lot of pain in installing python libraries. It also makes creating and managing virtual environments a piece of cake!

Installation
=============

You can install install ppsqlviz through pip

```
pip install ppsqlviz
```

This will install the dependent library (Pandas) if you don't already have that. I strongly encourage you use Anaconda Python to avoid going down the rabbit hole of PyData stack dependency nightmares.


Datasets Used
==============

For this demo, I'm using two publicly available datasets.
* [The UCI wine quality dataset](http://archive.ics.uci.edu/ml/datasets/Wine+Quality) - Here is a sampling of rows from this dataset:

```
alcohol | mmalic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280 | proline | quality
---------+-------------+------+-------------------+-----------+---------------+------------+----------------------+-----------------+-----------------+-------+-------+---------+---------
1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.8 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065
1 | 13.2 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.4 | 1050
1 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.8 | 3.24 | 0.3 | 2.81 | 5.68 | 1.03 | 3.17 | 1185
1 | 14.37 | 1.95 | 2.5 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.8 | 0.86 | 3.45 | 1480
1 | 13.24 | 2.59 | 2.87 | 21 | 118 | 2.8 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735
1 | 14.2 | 1.76 | 2.45 | 15.2 | 112 | 3.27 | 3.39 | 0.34 | 1.97 | 6.75 | 1.05 | 2.85 | 1450
1 | 14.39 | 1.87 | 2.45 | 14.6 | 96 | 2.5 | 2.52 | 0.3 | 1.98 | 5.25 | 1.02 | 3.58 | 1290
1 | 14.06 | 2.15 | 2.61 | 17.6 | 121 | 2.6 | 2.51 | 0.31 | 1.25 | 5.05 | 1.06 | 3.58 | 1295
1 | 14.83 | 1.64 | 2.17 | 14 | 97 | 2.8 | 2.98 | 0.29 | 1.98 | 5.2 | 1.08 | 2.85 | 1045
1 | 13.86 | 1.35 | 2.27 | 16 | 98 | 2.98 | 3.15 | 0.22 | 1.85 | 7.22 | 1.01 | 3.55 | 1045
1 | 14.1 | 2.16 | 2.3 | 18 | 105 | 2.95 | 3.32 | 0.22 | 2.38 | 5.75 | 1.25 | 3.17 | 1510
1 | 14.12 | 1.48 | 2.32 | 16.8 | 95 | 2.2 | 2.43 | 0.26 | 1.57 | 5 | 1.17 | 2.82 | 1280
1 | 13.75 | 1.73 | 2.41 | 16 | 89 | 2.6 | 2.76 | 0.29 | 1.81 | 5.6 | 1.15 | 2.9 | 1320
1 | 14.75 | 1.73 | 2.39 | 11.4 | 91 | 3.1 | 3.69 | 0.43 | 2.81 | 5.4 | 1.25 | 2.73 | 1150
1 | 14.38 | 1.87 | 2.38 | 12 | 102 | 3.3 | 3.64 | 0.29 | 2.96 | 7.5 | 1.2 | 3 | 1547
1 | 13.63 | 1.81 | 2.7 | 17.2 | 112 | 2.85 | 2.91 | 0.3 | 1.46 | 7.3 | 1.28 | 2.88 | 1310

```

* [The S&P daily prices dataset](http://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices) - Here is a sampling of rows from this dataset:

```
dt | open | high | low | close | volume | adj_close
------------+---------+---------+---------+---------+------------+-----------
2013-09-27 | 1695.52 | 1695.52 | 1687.11 | 1691.75 | 2951700000 | 1691.75
2012-04-23 | 1378.53 | 1378.53 | 1358.79 | 1366.94 | 3654860000 | 1366.94
2012-01-18 | 1293.65 | 1308.11 | 1290.99 | 1308.04 | 4096160000 | 1308.04
2011-09-07 | 1165.85 | 1198.62 | 1165.85 | 1198.62 | 4441040000 | 1198.62
2011-06-03 | 1312.94 | 1312.94 | 1297.9 | 1300.16 | 3505030000 | 1300.16
2011-03-31 | 1327.44 | 1329.77 | 1325.03 | 1325.83 | 3566270000 | 1325.83
2010-12-28 | 1259.1 | 1259.9 | 1256.22 | 1258.51 | 2478450000 | 1258.51
2010-09-23 | 1131.1 | 1136.77 | 1122.79 | 1124.83 | 3847850000 | 1124.83
2010-07-21 | 1086.67 | 1088.96 | 1065.25 | 1069.59 | 4747180000 | 1069.59
2010-05-13 | 1170.04 | 1173.57 | 1156.14 | 1157.44 | 4870640000 | 1157.44
2010-03-10 | 1140.22 | 1148.26 | 1140.09 | 1145.61 | 5469120000 | 1145.61
2009-12-04 | 1100.43 | 1119.13 | 1096.52 | 1105.98 | 5781140000 | 1105.98
2009-07-24 | 972.16 | 979.79 | 965.95 | 979.26 | 4458300000 | 979.26
2009-02-09 | 868.24 | 875.01 | 861.65 | 869.89 | 5574370000 | 869.89
2008-11-05 | 1001.84 | 1001.84 | 949.86 | 952.77 | 5426640000 | 952.77
2008-09-02 | 1287.83 | 1303.04 | 1272.2 | 1277.58 | 4783560000 | 1277.58
2008-04-30 | 1391.22 | 1404.57 | 1384.25 | 1385.59 | 4508890000 | 1385.59
2008-01-25 | 1357.32 | 1368.56 | 1327.5 | 1330.61 | 4882250000 | 1330.61
2007-09-14 | 1483.95 | 1485.99 | 1473.18 | 1484.25 | 2641740000 | 1484.25
```

Usage
======
Invoke Pandas plotting functions by piping in the output from a psql query.
You can re-use this boiler-plate code for Scatter Plots, Box Plots, Histograms and Time Series Plots on your tables.


Scatter Matrix
===============
This is pretty useful when you are interested in analyzing the correlation between a bunch of features in a dataset, particularly in their correlation with the target attribute/label. You might then perform feature selection based on a visual output of the correlations.

Here is how the scatter matrix can be created on the UCI Wine Quality Dataset
```
home$ psql -d vatsandb -h dca -U gpadmin -c 'select * from wine;' | python -m 'ppsqlviz.plotter' scatter
```
Here is the output ![Scatter Matrix of all features from the Wine Quality Dataset]
(https://raw.githubusercontent.com/vatsan/pandas_via_psql/master/plots/scatter_matrix.png)


Hexbin Plots
=============
Scatter plots sometimes may not reveal the underlying relationship between the dimensions when multiple points overlap.
For this reason, it is better to look at a 2-d histogram or a hex-bin plot. We can tap into `matplotlib's` hexbin plot for this.

You could invoke it from your command line like so:
```
home$ psql -d vatsandb -h dca -U gpadmin -c 'select ash, flavanoids from wine;' | python -m 'ppsqlviz.plotter' hexbin
```
Here is the output ![Hexbin plot of Ash vs. Flavanoids from Wine Quality Dataset]
(https://raw.githubusercontent.com/vatsan/pandas_via_psql/master/plots/hexbin.png)


Histogram Plot
==============
To get a quick glimpse of the distribution of the data in your columns, a histogram plot of all columns is quite useful.

You could invoke it from your command line like so:
```
home$ psql -d vatsandb -h dca -U gpadmin -c 'select ash, flavanoids, hue, proline from wine;' | python -m 'ppsqlviz.plotter' hist
```
Here is the output
![Histogram Plots of some features from the Wine Quality Dataset](https://raw.githubusercontent.com/vatsan/pandas_via_psql/master/plots/histogram.png)

Density Plot
=============
In place of binning your data, you might consider plotting the density directly.

You could invoke it from your command line like so:
```
home$ psql -d vatsandb -h dca -U gpadmin -c 'select ash, flavanoids, hue, proline from wine;' | python -m 'ppsqlviz.plotter' density
```
Here is the output ![Density Plots of some features from the Wine Quality Dataset](https://raw.github.com/vatsan/pandas_via_psql/master/plots/density.png)

Box Plot
=========
Box plots are useful in visually getting a feel for the quartile ranges of numerical columns in your dataset. You could invoke it from your command line like so:

```
home$ psql -d vatsandb -h dca -U gpadmin -c 'select ash, flavanoids, hue, proline from wine;' | python -m 'ppsqlviz.plotter' box
```
Here is the output ![Box Plot of some features from the Wine Quality Dataset](https://raw.githubusercontent.com/vatsan/pandas_via_psql/master/plots/boxplot.png)

Time Series Plot
=================
Again, Pandas has an impressive collection of functions for time series analysis but to quickly visualize a time series, you can run the following from your command line:
```
home$ psql -d vatsandb -h dca -U gpadmin -c 'select dt, high, low from sandp_prices where dt > 1998 order by dt;' | python -m 'ppsqlviz.plotter' tseries
```
Here is the output ![Time Series Plotting of S&P](https://raw.githubusercontent.com/vatsan/pandas_via_psql/master/plots/time_series.png)

Bar Plot
==================
Bar plots are typically used to plot binned data, where the data is binned according to user specified bins. This support is provided in pandas-via-psql. The data table is expected to comprise of two array columns of the same length, one each for the x and y axes. You can plot a bar plot by running the following from your command line:

```
home$ psql -d <dbname> -h <hostname> -U gpadmin -c 'select x*10 as binCenter, random()*100 as count from generate_series(1, 100) x;' | python -m 'ppsqlviz.plotter' bar
```
The first column always has to be the x axis (bin center).
Here's the output ![Bar Plot](https://raw.githubusercontent.com/vatsan/pandas_via_psql/master/plots/barplot.png)

Image Rendering
===================
Pandas also has a great set of tools for viewing images: grayscale or RGB, which can be quite handy when working on image processing or computer vision in SQL. For example, to check a binary mask after thresholding or the weights output by a deep learning algorithm, it is much easier to visualize an image than to interpret a table of intensity values.
To view an image whose intensity values are stored in a table, simply select the height and width of the image (number of rows & columns) followed by a vector of intensity values ordered by row, then column. For example, to view this 270x360 pixel grayscale image, you can run the following from your command line:

```
home$ psql -d vatsandb -h dca -U gpadmin -c 'select 270 as rows, 360 as cols, intensity_values from sample_image;' | python -m 'ppsqlviz.plotter' image
```

Here is the output ![Sample Grayscale image](https://raw.githubusercontent.com/vatsan/pandas_via_psql/master/plots/YosemiteGrayscale.png)

Similarly, to view an RGB image, provide the image height and width followed by a vector of intensity values ordered by row, then column, then color. To view a sample RGB image you can run the following from your comman line:

```
home$ psql -d vatsandb -h dca -U gpadmin -c 'select max(row)+1, max(col)+1, array[array_agg(red_intensity order by row,col), array_agg(green_intensity order by row,col), array_agg(blue_intensity order by row,col)] from (select * from sample_RGB_image order by row,col)t;' | python -m 'ppsqlviz.plotter' imageRGB
```

Here is the output ![Sample RGB image](https://raw.githubusercontent.com/vatsan/pandas_via_psql/master/plots/YosemiteRGB.jpeg)


Author
=======

Please email questions and feedback to [Srivatsan Ramanujam](https://github.com/vatsan/) at vatsan.cs@utexas.edu

Contributors
==============

Thanks to [Ailey Crow](https://github.com/ailey) and [Gautam Muralidhar](https://github.com/gautamsm) for their contributions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ppsqlviz-1.0.tar.gz (9.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page