A package for quick, scrappy analyses with pandas and SQL
Project description
siuba
scrappy data analysis, with seamless support for pandas and SQL
siuba (小巴) is a port of dplyr and other R libraries. It supports a tabular data analysis workflow centered on 5 common actions:
select()
- keep certain columns of data.filter()
- keep certain rows of data.mutate()
- create or modify an existing column of data.summarize()
- reduce one or more columns down to a single number.arrange()
- reorder the rows of data.
These actions can be preceded by a group_by()
, which causes them to be applied individually to grouped rows of data. Moreover, many SQL concepts, such as distinct()
, count()
, and joins are implemented.
Inputs to these functions can be a pandas DataFrame
or SQL connection (currently postgres, redshift, or sqlite).
For more on the rationale behind tools like dplyr, see this tidyverse paper. For examples of siuba in action, see the siuba guide.
Installation
pip install siuba
Examples
See the siuba guide or this live analysis for a full introduction.
Basic use
The code below uses the example DataFrame mtcars
, to get the average horsepower (hp) per cylinder.
from siuba import group_by, summarize, _
from siuba.data import mtcars
(mtcars
>> group_by(_.cyl)
>> summarize(avg_hp = _.hp.mean())
)
Out[1]:
cyl avg_hp
0 4 82.636364
1 6 122.285714
2 8 209.214286
There are three key concepts in this example:
concept | example | meaning |
---|---|---|
verb | group_by(...) |
a function that operates on a table, like a DataFrame or SQL table |
siu expression | _.hp.mean() |
an expression created with siuba._ , that represents actions you want to perform |
pipe | mtcars >> group_by(...) |
a syntax that allows you to chain verbs with the >> operator |
See the siuba guide overview for a full introduction.
What is a siu expression (e.g. _.cyl == 4
)?
A siu expression is a way of specifying what action you want to perform. This allows siuba verbs to decide how to execute the action, depending on whether your data is a local DataFrame or remote table.
from siuba import _
_.cyl == 4
Out[2]:
█─==
├─█─.
│ ├─_
│ └─'cyl'
└─4
You can also think of siu expressions as a shorthand for a lambda function.
from siuba import _
# lambda approach
mtcars[lambda _: _.cyl == 4]
# siu expression approach
mtcars[_.cyl == 4]
Out[3]:
mpg cyl disp hp drat wt qsec vs am gear carb
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
7 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
.. ... ... ... ... ... ... ... .. .. ... ...
27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
[11 rows x 11 columns]
See the siuba guide or read more about lazy expressions.
Using with a SQL database
A killer feature of siuba is that the same analysis code can be run on a local DataFrame, or a SQL source.
In the code below, we set up an example database.
# Setup example data ----
from sqlalchemy import create_engine
from siuba.data import mtcars
# copy pandas DataFrame to sqlite
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")
Next, we use the code from the first example, except now executed a SQL table.
# Demo SQL analysis with siuba ----
from siuba import _, tbl, group_by, summarize, filter
# connect with siuba
tbl_mtcars = tbl(engine, "mtcars")
(tbl_mtcars
>> group_by(_.cyl)
>> summarize(avg_hp = _.hp.mean())
)
Out[4]:
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
cyl avg_hp
0 4 82.636364
1 6 122.285714
2 8 209.214286
# .. may have more rows
See the querying SQL introduction here.
Example notebooks
Below are some examples I've kept as I've worked on siuba. For the most up to date explanations, see the siuba guide
- siu expressions
- dplyr style pandas
- sql using dplyr style
- tidytuesday examples
- tidytuesday is a weekly R data analysis project. In order to kick the tires on siuba, I've been using it to complete the assignments. More specifically, I've been porting Dave Robinson's tidytuesday analyses to use siuba.
Testing
Tests are done using pytest. They can be run using the following.
# start postgres db
docker-compose up
pytest siuba
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file siuba-0.4.4.tar.gz
.
File metadata
- Download URL: siuba-0.4.4.tar.gz
- Upload date:
- Size: 170.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73fb5e3934e45f2083cf0cc362c761fac2f1ae6f918746442edd8f493f009387 |
|
MD5 | 6d2b4d9aa9922c0ef7decc8594d89a6b |
|
BLAKE2b-256 | 4cc35203adb162baea4eebe869bffde230f806b612e36a6e3bf4049c3452c786 |
File details
Details for the file siuba-0.4.4-py3-none-any.whl
.
File metadata
- Download URL: siuba-0.4.4-py3-none-any.whl
- Upload date:
- Size: 208.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 07a6b2a02f39e53a8fdb1f1a3a7c49bce182346901e99df1d073e5427cf9e9dc |
|
MD5 | 0a05413635f481da7e39a5e1be482fbb |
|
BLAKE2b-256 | bea959a1d3ae43ce39b3b1addf670c304312b732fd5e29c6f96132c66019c801 |