Skip to main content

Example datasets from statistics packages.

Project description

ExampleDatasets Python package

Python package for (obtaining) example datasets.

Currently, this repository contains only datasets metadata. The datasets are downloaded from the repository Rdatasets, [VAB1].

This package follows the design of the Raku package with the same name; see [AAr1].


Usage examples

Setup

Here we load the Python packages time, pandas, and this package:

import time
import pandas
from ExampleDatasets import *

Get a dataset by using an identifier

Here we get a dataset by using an identifier and display part of the obtained dataset:

tbl = example_dataset(itemSpec = 'Baumann')
tbl.head

## <bound method NDFrame.head of     Unnamed: 0  group  pretest.1  ...  post.test.1  post.test.2  post.test.3
## 0            1  Basal          4  ...            5            4           41
## 1            2  Basal          6  ...            9            5           41
## 2            3  Basal          9  ...            5            3           43
## 3            4  Basal         12  ...            8            5           46
## 4            5  Basal         16  ...           10            9           46
## ..         ...    ...        ...  ...          ...          ...          ...
## 61          62  Strat         11  ...           11            7           48
## 62          63  Strat         14  ...           15            7           49
## 63          64  Strat          8  ...            9            5           33
## 64          65  Strat          5  ...            6            8           45
## 65          66  Strat          8  ...            4            6           42
## 
## [66 rows x 7 columns]>

Here we summarize the dataset obtained above:

tbl.describe()

##        Unnamed: 0  pretest.1  pretest.2  post.test.1  post.test.2  post.test.3
## count   66.000000  66.000000  66.000000    66.000000    66.000000    66.000000
## mean    33.500000   9.787879   5.106061     8.075758     6.712121    44.015152
## std     19.196354   3.020520   2.212752     3.393707     2.635644     6.643661
## min      1.000000   4.000000   1.000000     1.000000     0.000000    30.000000
## 25%     17.250000   8.000000   3.250000     5.000000     5.000000    40.000000
## 50%     33.500000   9.000000   5.000000     8.000000     6.000000    45.000000
## 75%     49.750000  12.000000   6.000000    11.000000     8.000000    49.000000
## max     66.000000  16.000000  13.000000    15.000000    13.000000    57.000000

Remark: The values for the arguments itemSpec and packageSpec correspond to the values of the columns “Item” and “Package”, respectively, in the metadata dataset from the GitHub repository “Rdatasets”, [VAB1]. See the datasets metadata sub-section below.

Get a dataset by using an URL

Here we can find URLs of datasets that have titles adhering to a regex:

dfMeta = load_datasets_metadata()
print(dfMeta[dfMeta.Title.str.contains('^tita')][["Package", "Item", "CSV"]].to_string())

##     Package        Item                                                                      CSV
## 288   COUNT     titanic     https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanic.csv
## 289   COUNT  titanicgrp  https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanicgrp.csv

Here we get a dataset through pandas by using an URL and display the head of the obtained dataset:

url = 'https://raw.githubusercontent.com/antononcube/Raku-Data-Reshapers/main/resources/dfTitanic.csv'
tbl2 = pandas.read_csv(url)
tbl2.head()

##    id passengerClass  passengerAge passengerSex passengerSurvival
## 0   1            1st            30       female          survived
## 1   2            1st             0         male          survived
## 2   3            1st             0       female              died
## 3   4            1st            30         male              died
## 4   5            1st            20       female              died

Datasets metadata

Here we: 1. Get the dataset of the datasets metadata 2. Filter it to have only datasets with 13 rows 3. Keep only the columns “Item”, “Title”, “Rows”, and “Cols” 4. Display it

tblMeta = load_datasets_metadata()
tblMeta = tblMeta[["Item", "Title", "Rows", "Cols"]]
tblMeta = tblMeta[tblMeta["Rows"] == 13]
tblMeta

##             Item                                              Title  Rows  Cols
## 805   Snow.pumps  John Snow's Map and Data on the 1854 London Ch...    13     4
## 820          BCG                                   BCG Vaccine Data    13     7
## 935       cement                    Heat Evolved by Setting Cements    13     5
## 1354    kootenay  Waterflow Measurements of Kootenay River in Li...    13     2
## 1644  Newhouse77  Medical-Care Expenditure: A Cross-National Sur...    13     5
## 1735      Saxony                                 Families in Saxony    13     2

Keeping downloaded data

By default the data is obtained over the web from Rdatasets, but example_dataset has an option to keep the data “locally.” (The data is saved in XDG_DATA_HOME, see [SS1].)

This can be demonstrated with the following timings of a dataset with ~1300 rows:

startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Geting the data first time took " + str( endTime - startTime ) + " seconds")

## Geting the data first time took 0.33232 seconds

startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Geting the data second time took " + str( endTime - startTime ) + " seconds")

## Geting the data second time took 0.01386 seconds

References

Functions, packages, repositories

[AAf1] Anton Antonov, ExampleDataset, (2020), Wolfram Function Repository.

[AAr1] Anton Antonov, Data::ExampleDatasets Raku package, (2021), GitHub/antononcube.

[VAB1] Vincent Arel-Bundock, Rdatasets, (2020), GitHub/vincentarelbundock.

[SS1] Scott Stevenson, xdg Python package, (2016-2021), PyPI.org.

Interactive interfaces

[AAi1] Anton Antonov, Example datasets recommender interface, (2021), Shinyapps.io.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ExampleDatasets-0.1.1.tar.gz (71.3 kB view hashes)

Uploaded Source

Built Distribution

ExampleDatasets-0.1.1-py3-none-any.whl (71.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page