Skip to main content

Run pandas-profiling HTML reports for a given list of database tables.

Project description

Poliwhirl

Are you working with an unfamiliar database and feeling confused?

Do you find yourself running SELECT * FROM TABLE LIMIT 100 queries to remind yourself possible values of key fields?

Do these values slowly change over the course of years, with no clear documentation from upstream data producers?

If you answered "yes" to any of the above questions, this package may be for you.

Image of Poliwhirl pokemon

What it does

Poliwhirl helps you orient yourself in an unfamiliar database by generating useful HTML reports (via pandas-profiling) for key tables you specify. It saves all these outputs to a single directory, which you can index locally with Spotlight or Alfred, or even deploy to some internal static website for your team.

Installing

You can install this package via pip:

pip install polywhirl

Features and usage

Polywhirl takes a single argument of a yaml file containing the structure of the database you'd like to profile. The format of this yaml file approximately follows that of dbt's schema.yml. A template file tables.yml is provided for you, but you'll need to input the lists of schemas and tables specific to your own database.

polywhirl tables.yml

Polywhirl currently supports these connections:

  1. BigQuery (use name: bigquery in tables.yml)
  2. Redshift (use name: redshift in tables.yml)

For the sake of performance, polywhirl will pull a random sample of 10k rows from each table. For Redshift, it supports defining a sortkey for each table, which is used to limit data to the most recent 90 days. This improves performance on large tables.

BigQuery credentials are handled by pandas.read_gbq() which relies on the pandas-gbq package.

Redshift credentials are requested on first run, then stored locally in your system keychain using the keyring package.

FAQ

Do you realize you misspelled Poliwhirl?

Yes I noticed this as I was writing this README—but it's too late—the name has grown on me.

Todo

Would like to get to the following in the future, so feel free to send PRs my way on any of these:

  • Add equivalent of Redshift sortkey to BigQuery logic, to prevent unnecessary full table scans
  • Compile .html outputs into a searchable static website
  • Automated tests (w/ pytest)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polywhirl-0.1.0.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

polywhirl-0.1.0-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file polywhirl-0.1.0.tar.gz.

File metadata

  • Download URL: polywhirl-0.1.0.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2.post20191203 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.7.4

File hashes

Hashes for polywhirl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c80d4d710d222d2f665a15365339338fd3a146a38ef2092e9b39c584bf4bf85d
MD5 d7e03d06ea1cec63f55b8a912832e66e
BLAKE2b-256 2f2b6c47a92924f7c2eca4667cf707395adcd3b5434c568b3edc3533ca530d82

See more details on using hashes here.

File details

Details for the file polywhirl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: polywhirl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2.post20191203 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.7.4

File hashes

Hashes for polywhirl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 20a5f110de15a803f428e5e19b087984aeeb5eab92a022d32bf555582ac31e9b
MD5 341a04b82e61bc24d0084f8d6104c038
BLAKE2b-256 2ac703bc01c7a21e9913bdcec065b51134dbf47fbabeea3aa29dcb1c15f28702

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page