Skip to main content

Convenience interface of Reatasets for lazy data scientists.

Project description

lazy-rdatasets

Convenience interface of Rdatasets for lazy data scientists.

Key features

  • Search and filter datasets by text, data types, no. samples, and no. features/parameters
  • Randomly select a dataset
  • Easily do basic visualizations without writing many lines

Install

Use pip:

pip install lazy-rdatasets

Usage

List all available datasets

from lazyrdatasets import LazyRdatasets

rd = LazyRdatasets()
rd
   Package  Item           Title                              Rows  Cols  ...
0  AER      Affairs        Fair's Extramarital Affairs Data    601     9  ...
1  AER      ArgentinaCPI   Consumer Price Index in Argentina    80     2  ...
2  AER      BankWages      Bank Wages                          474     4  ...
3  AER      BenderlyZwick  Benderly and Zwick Data: Infla...    31     5  ...
4  AER      BondYield      Bond Yield Data                      60     2  ...
...

Finding datasets

1. Exact match on the package and item names

This is almost the same as statsmodels.datasets.get_rdatasets.

from lazyrdatasets import LazyRdatasets

rd = LazyRdatasets.find(package="palmerpenguins", item="penguins", exact=True)

2. Search a string in the title of datasets

Find datasets that have "penguin" in their title:

from lazyrdatasets import LazyRdatasets

rd = LazyRdatasets.find(title="penguin")

3. Filtering datasets that contains particular types of variables

Find datasets that have categorical variables:

from lazyrdatasets import LazyRdatasets

rd = LazyRdatasets.find(categorical=True)

Find datasets that have only one numeric variables with more than 99 samples:

from lazyrdatasets import LazyRdatasets

rd = LazyRdatasets.find(numeric=True, nmin=100, pmax=1)

Getting a single dataset from Rdatasets

Below, rd is the output of the following code block:

from lazyrdatasets import LazyRdatasets

rd = LazyRdatasets.find(title="penguin")

1. Show the list of datasets that matched the conditions

rd  # On jupyter, this will show the same result as below.
rd.catalog  # Show the repr of DataFrame
      Package          Item                     Title   ...
567   bayesrules       penguins_bayes           Penguins Data
1694  heplots          peng                     Size measurements ...
2116  modeldata        penguins                 Palmer Station penguin data
2559  palmerpenguins   penguins                 Size measurements for  ...
2560  palmerpenguins   penguins_raw (penguins)  Penguin size, clutch, ...

2. Pick the first dataset in the list

rd.first
#️⃣ Index  : 567
📦 Package: bayesrules
📄 Item   : penguins_bayes
📚 Title  : Penguins Data
📐 Shape  : (344, 10)
  ⚖️ Binary   : 2
  🔤 Character: 0
  🧮 Factor   : 4
  🔘 Logical  : 0
  🔢 Numeric  : 5
🔗 CSV: https://vincentarelbundock.github.io/Rdatasets/csv/bayesrules/penguins_bayes.csv
🔗 Doc: https://vincentarelbundock.github.io/Rdatasets/doc/bayesrules/penguins_bayes.html

3. Acceess by Index and position in the catalog

rd[2559]  # Get the Dataset with its index 2559 (pandas.DataFrame.loc), or ...
rd.at(3)  # Get the Dataset at position 3 in the catalog (pandas.DataFrame.iloc)
#️⃣ Index  : 2559
📦 Package: palmerpenguins
📄 Item   : penguins
📚 Title  : Size measurements for adult foraging penguins near Palmer Station, Antarctica
📐 Shape  : (344, 9)
  ⚖️ Binary   : 1
  🔤 Character: 0
  🧮 Factor   : 3
  🔘 Logical  : 0
  🔢 Numeric  : 5
🔗 CSV: https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv
🔗 Doc: https://vincentarelbundock.github.io/Rdatasets/doc/palmerpenguins/penguins.html

Getting the dataframe

ds = rd[2559]
ds.data  # -> pandas.DataFrame
   rownames species     island  bill_length_mm  bill_depth_mm ...
0         1  Adelie  Torgersen            39.1           18.7
1         2  Adelie  Torgersen            39.5           17.4
2         3  Adelie  Torgersen            40.3           18.0
3         4  Adelie  Torgersen             NaN            NaN
4         5  Adelie  Torgersen            36.7           19.3
...

Quicklook

Selected dataset can be easily quicklooked by

ds.quicklook()
Missing Values Categorical Variables Numeric Variables(p≧1) Numerical Variables (p≧3)
fig_missing fig_categorical fig_numeric fig_pca
Heatmap Bar plots Scatter matrix (p≧2) / histogram (p=1) PCA projection (p≧3)

This is of course not the best visualization, but might be helpful to get an overview of the dataset.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lazyrdatasets-0.2.0.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lazyrdatasets-0.2.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file lazyrdatasets-0.2.0.tar.gz.

File metadata

  • Download URL: lazyrdatasets-0.2.0.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for lazyrdatasets-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c8057e2f2ee0a563e64387d63f57110f7321862ffbcc241063feb279dcac0276
MD5 22905ab638df8fec7c58ffe0b897fbac
BLAKE2b-256 1edc1cd46dff11aa8d29d9607859c630af1106f4c7a47b018bbdaf67baff5e91

See more details on using hashes here.

File details

Details for the file lazyrdatasets-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: lazyrdatasets-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for lazyrdatasets-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4e144b8270e19e13da4affe46f49ba1f5949d6f6e51205e402a735c8c424856c
MD5 ed29e2ad51e82be6de5cb772d2d87eed
BLAKE2b-256 70a5be46e9a7123a3a8c2289b6024ac7322a79e4f54802858edc136b71333a56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page