Convenience interface of Reatasets for lazy data scientists.
Project description
lazy-rdatasets
Convenience interface of Rdatasets for lazy data scientists.
Key features
- Search and filter datasets by text, data types, no. samples, and no. features/parameters
- Randomly select a dataset
- Easily do basic visualizations without writing many lines
Install
Use pip:
pip install lazy-rdatasets
Usage
List all available datasets
from lazyrdatasets import LazyRdatasets
rd = LazyRdatasets()
rd
Package Item Title Rows Cols ...
0 AER Affairs Fair's Extramarital Affairs Data 601 9 ...
1 AER ArgentinaCPI Consumer Price Index in Argentina 80 2 ...
2 AER BankWages Bank Wages 474 4 ...
3 AER BenderlyZwick Benderly and Zwick Data: Infla... 31 5 ...
4 AER BondYield Bond Yield Data 60 2 ...
...
Finding datasets
1. Exact match on the package and item names
This is almost the same as statsmodels.datasets.get_rdatasets.
from lazyrdatasets import LazyRdatasets
rd = LazyRdatasets.find(package="palmerpenguins", item="penguins", exact=True)
2. Search a string in the title of datasets
Find datasets that have "penguin" in their title:
from lazyrdatasets import LazyRdatasets
rd = LazyRdatasets.find(title="penguin")
3. Filtering datasets that contains particular types of variables
Find datasets that have categorical variables:
from lazyrdatasets import LazyRdatasets
rd = LazyRdatasets.find(categorical=True)
Find datasets that have only one numeric variables with more than 99 samples:
from lazyrdatasets import LazyRdatasets
rd = LazyRdatasets.find(numeric=True, nmin=100, pmax=1)
Getting a single dataset from Rdatasets
Below, rd is the output of the following code block:
from lazyrdatasets import LazyRdatasets
rd = LazyRdatasets.find(title="penguin")
1. Show the list of datasets that matched the conditions
rd # On jupyter, this will show the same result as below.
rd.catalog # Show the repr of DataFrame
Package Item Title ...
567 bayesrules penguins_bayes Penguins Data
1694 heplots peng Size measurements ...
2116 modeldata penguins Palmer Station penguin data
2559 palmerpenguins penguins Size measurements for ...
2560 palmerpenguins penguins_raw (penguins) Penguin size, clutch, ...
2. Pick the first dataset in the list
rd.first
#️⃣ Index : 567
📦 Package: bayesrules
📄 Item : penguins_bayes
📚 Title : Penguins Data
📐 Shape : (344, 10)
⚖️ Binary : 2
🔤 Character: 0
🧮 Factor : 4
🔘 Logical : 0
🔢 Numeric : 5
🔗 CSV: https://vincentarelbundock.github.io/Rdatasets/csv/bayesrules/penguins_bayes.csv
🔗 Doc: https://vincentarelbundock.github.io/Rdatasets/doc/bayesrules/penguins_bayes.html
3. Acceess by Index and position in the catalog
rd[2559] # Get the Dataset with its index 2559 (pandas.DataFrame.loc), or ...
rd.at(3) # Get the Dataset at position 3 in the catalog (pandas.DataFrame.iloc)
#️⃣ Index : 2559
📦 Package: palmerpenguins
📄 Item : penguins
📚 Title : Size measurements for adult foraging penguins near Palmer Station, Antarctica
📐 Shape : (344, 9)
⚖️ Binary : 1
🔤 Character: 0
🧮 Factor : 3
🔘 Logical : 0
🔢 Numeric : 5
🔗 CSV: https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv
🔗 Doc: https://vincentarelbundock.github.io/Rdatasets/doc/palmerpenguins/penguins.html
Getting the dataframe
ds = rd[2559]
ds.data # -> pandas.DataFrame
rownames species island bill_length_mm bill_depth_mm ...
0 1 Adelie Torgersen 39.1 18.7
1 2 Adelie Torgersen 39.5 17.4
2 3 Adelie Torgersen 40.3 18.0
3 4 Adelie Torgersen NaN NaN
4 5 Adelie Torgersen 36.7 19.3
...
Quicklook
Selected dataset can be easily quicklooked by
ds.quicklook()
| Missing Values | Categorical Variables | Numeric Variables(p≧1) | Numerical Variables (p≧3) |
|---|---|---|---|
| Heatmap | Bar plots | Scatter matrix (p≧2) / histogram (p=1) | PCA projection (p≧3) |
This is of course not the best visualization, but might be helpful to get an overview of the dataset.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lazyrdatasets-0.2.0.tar.gz.
File metadata
- Download URL: lazyrdatasets-0.2.0.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8057e2f2ee0a563e64387d63f57110f7321862ffbcc241063feb279dcac0276
|
|
| MD5 |
22905ab638df8fec7c58ffe0b897fbac
|
|
| BLAKE2b-256 |
1edc1cd46dff11aa8d29d9607859c630af1106f4c7a47b018bbdaf67baff5e91
|
File details
Details for the file lazyrdatasets-0.2.0-py3-none-any.whl.
File metadata
- Download URL: lazyrdatasets-0.2.0-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e144b8270e19e13da4affe46f49ba1f5949d6f6e51205e402a735c8c424856c
|
|
| MD5 |
ed29e2ad51e82be6de5cb772d2d87eed
|
|
| BLAKE2b-256 |
70a5be46e9a7123a3a8c2289b6024ac7322a79e4f54802858edc136b71333a56
|