Skip to main content

A package for quick, scrappy analyses with pandas and SQL

Project description

siuba

scrappy data analysis, with seamless support for pandas and SQL

CI Documentation Status Binder

siuba (小巴) is a port of dplyr and other R libraries. It supports a tabular data analysis workflow centered on 5 common actions:

  • select() - keep certain columns of data.
  • filter() - keep certain rows of data.
  • mutate() - create or modify an existing column of data.
  • summarize() - reduce one or more columns down to a single number.
  • arrange() - reorder the rows of data.

These actions can be preceded by a group_by(), which causes them to be applied individually to grouped rows of data. Moreover, many SQL concepts, such as distinct(), count(), and joins are implemented. Inputs to these functions can be a pandas DataFrame or SQL connection (currently postgres, redshift, or sqlite).

For more on the rationale behind tools like dplyr, see this tidyverse paper. For examples of siuba in action, see the siuba guide.

Installation

pip install siuba

Examples

See the siuba guide or this live analysis for a full introduction.

Basic use

The code below uses the example DataFrame mtcars, to get the average horsepower (hp) per cylinder.

from siuba import group_by, summarize, _
from siuba.data import mtcars

(mtcars
  >> group_by(_.cyl)
  >> summarize(avg_hp = _.hp.mean())
  )
Out[1]: 
   cyl      avg_hp
0    4   82.636364
1    6  122.285714
2    8  209.214286

There are three key concepts in this example:

concept example meaning
verb group_by(...) a function that operates on a table, like a DataFrame or SQL table
siu expression _.hp.mean() an expression created with siuba._, that represents actions you want to perform
pipe mtcars >> group_by(...) a syntax that allows you to chain verbs with the >> operator

See the siuba guide overview for a full introduction.

What is a siu expression (e.g. _.cyl == 4)?

A siu expression is a way of specifying what action you want to perform. This allows siuba verbs to decide how to execute the action, depending on whether your data is a local DataFrame or remote table.

from siuba import _

_.cyl == 4
Out[2]:
█─==
├─█─.
│ ├─_
│ └─'cyl'
└─4

You can also think of siu expressions as a shorthand for a lambda function.

from siuba import _

# lambda approach
mtcars[lambda _: _.cyl == 4]

# siu expression approach
mtcars[_.cyl == 4]
Out[3]: 
     mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
2   22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
7   24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2
..   ...  ...    ...  ...   ...    ...    ...  ..  ..   ...   ...
27  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2
31  21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2

[11 rows x 11 columns]

See the siuba guide or read more about lazy expressions.

Using with a SQL database

A killer feature of siuba is that the same analysis code can be run on a local DataFrame, or a SQL source.

In the code below, we set up an example database.

# Setup example data ----
from sqlalchemy import create_engine
from siuba.data import mtcars

# copy pandas DataFrame to sqlite
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")

Next, we use the code from the first example, except now executed a SQL table.

# Demo SQL analysis with siuba ----
from siuba import _, tbl, group_by, summarize, filter

# connect with siuba
tbl_mtcars = tbl(engine, "mtcars")

(tbl_mtcars
  >> group_by(_.cyl)
  >> summarize(avg_hp = _.hp.mean())
  )
Out[4]: 
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
   cyl      avg_hp
0    4   82.636364
1    6  122.285714
2    8  209.214286
# .. may have more rows

See the querying SQL introduction here.

Example notebooks

Below are some examples I've kept as I've worked on siuba. For the most up to date explanations, see the siuba guide

Testing

Tests are done using pytest. They can be run using the following.

# start postgres db
docker-compose up
pytest siuba

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

siuba-0.4.4.tar.gz (170.5 kB view details)

Uploaded Source

Built Distribution

siuba-0.4.4-py3-none-any.whl (208.6 kB view details)

Uploaded Python 3

File details

Details for the file siuba-0.4.4.tar.gz.

File metadata

  • Download URL: siuba-0.4.4.tar.gz
  • Upload date:
  • Size: 170.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for siuba-0.4.4.tar.gz
Algorithm Hash digest
SHA256 73fb5e3934e45f2083cf0cc362c761fac2f1ae6f918746442edd8f493f009387
MD5 6d2b4d9aa9922c0ef7decc8594d89a6b
BLAKE2b-256 4cc35203adb162baea4eebe869bffde230f806b612e36a6e3bf4049c3452c786

See more details on using hashes here.

File details

Details for the file siuba-0.4.4-py3-none-any.whl.

File metadata

  • Download URL: siuba-0.4.4-py3-none-any.whl
  • Upload date:
  • Size: 208.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for siuba-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 07a6b2a02f39e53a8fdb1f1a3a7c49bce182346901e99df1d073e5427cf9e9dc
MD5 0a05413635f481da7e39a5e1be482fbb
BLAKE2b-256 bea959a1d3ae43ce39b3b1addf670c304312b732fd5e29c6f96132c66019c801

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page