Skip to main content

Example datasets from statistics packages.

Project description

ExampleDatasets Python package

Python package for (obtaining) example datasets.

Currently, this repository contains only datasets metadata. The datasets are downloaded from the repository Rdatasets, [VAB1].

This package follows the design of the Raku package with the same name; see [AAr1].


Usage examples

Setup

Here we load the Python packages time, pandas, and this package:

import time
import pandas
from ExampleDatasets import *

Get a dataset by using an identifier

Here we get a dataset by using an identifier and display part of the obtained dataset:

tbl = example_dataset(itemSpec = 'Baumann')
tbl.head

## <bound method NDFrame.head of     Unnamed: 0  group  pretest.1  ...  post.test.1  post.test.2  post.test.3
## 0            1  Basal          4  ...            5            4           41
## 1            2  Basal          6  ...            9            5           41
## 2            3  Basal          9  ...            5            3           43
## 3            4  Basal         12  ...            8            5           46
## 4            5  Basal         16  ...           10            9           46
## ..         ...    ...        ...  ...          ...          ...          ...
## 61          62  Strat         11  ...           11            7           48
## 62          63  Strat         14  ...           15            7           49
## 63          64  Strat          8  ...            9            5           33
## 64          65  Strat          5  ...            6            8           45
## 65          66  Strat          8  ...            4            6           42
## 
## [66 rows x 7 columns]>

Here we summarize the dataset obtained above:

tbl.describe()

##        Unnamed: 0  pretest.1  pretest.2  post.test.1  post.test.2  post.test.3
## count   66.000000  66.000000  66.000000    66.000000    66.000000    66.000000
## mean    33.500000   9.787879   5.106061     8.075758     6.712121    44.015152
## std     19.196354   3.020520   2.212752     3.393707     2.635644     6.643661
## min      1.000000   4.000000   1.000000     1.000000     0.000000    30.000000
## 25%     17.250000   8.000000   3.250000     5.000000     5.000000    40.000000
## 50%     33.500000   9.000000   5.000000     8.000000     6.000000    45.000000
## 75%     49.750000  12.000000   6.000000    11.000000     8.000000    49.000000
## max     66.000000  16.000000  13.000000    15.000000    13.000000    57.000000

Remark: The values for the arguments itemSpec and packageSpec correspond to the values of the columns “Item” and “Package”, respectively, in the metadata dataset from the GitHub repository “Rdatasets”, [VAB1]. See the datasets metadata sub-section below.

Get a dataset by using an URL

Here we can find URLs of datasets that have titles adhering to a regex:

dfMeta = load_datasets_metadata()
print(dfMeta[dfMeta.Title.str.contains('^tita')][["Package", "Item", "CSV"]].to_string())

##     Package        Item                                                                      CSV
## 288   COUNT     titanic     https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanic.csv
## 289   COUNT  titanicgrp  https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanicgrp.csv

Here we get a dataset through pandas by using an URL and display the head of the obtained dataset:

url = 'https://raw.githubusercontent.com/antononcube/Raku-Data-Reshapers/main/resources/dfTitanic.csv'
tbl2 = pandas.read_csv(url)
tbl2.head()

##    id passengerClass  passengerAge passengerSex passengerSurvival
## 0   1            1st            30       female          survived
## 1   2            1st             0         male          survived
## 2   3            1st             0       female              died
## 3   4            1st            30         male              died
## 4   5            1st            20       female              died

Datasets metadata

Here we: 1. Get the dataset of the datasets metadata 2. Filter it to have only datasets with 13 rows 3. Keep only the columns “Item”, “Title”, “Rows”, and “Cols” 4. Display it

tblMeta = load_datasets_metadata()
tblMeta = tblMeta[["Item", "Title", "Rows", "Cols"]]
tblMeta = tblMeta[tblMeta["Rows"] == 13]
tblMeta

##             Item                                              Title  Rows  Cols
## 805   Snow.pumps  John Snow's Map and Data on the 1854 London Ch...    13     4
## 820          BCG                                   BCG Vaccine Data    13     7
## 935       cement                    Heat Evolved by Setting Cements    13     5
## 1354    kootenay  Waterflow Measurements of Kootenay River in Li...    13     2
## 1644  Newhouse77  Medical-Care Expenditure: A Cross-National Sur...    13     5
## 1735      Saxony                                 Families in Saxony    13     2

Keeping downloaded data

By default the data is obtained over the web from Rdatasets, but example_dataset has an option to keep the data “locally.” (The data is saved in XDG_DATA_HOME, see [SS1].)

This can be demonstrated with the following timings of a dataset with ~1300 rows:

startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Geting the data first time took " + str( endTime - startTime ) + " seconds")

## Geting the data first time took 0.33232 seconds

startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Geting the data second time took " + str( endTime - startTime ) + " seconds")

## Geting the data second time took 0.01386 seconds

References

Functions, packages, repositories

[AAf1] Anton Antonov, ExampleDataset, (2020), Wolfram Function Repository.

[AAr1] Anton Antonov, Data::ExampleDatasets Raku package, (2021), GitHub/antononcube.

[VAB1] Vincent Arel-Bundock, Rdatasets, (2020), GitHub/vincentarelbundock.

[SS1] Scott Stevenson, xdg Python package, (2016-2021), PyPI.org.

Interactive interfaces

[AAi1] Anton Antonov, Example datasets recommender interface, (2021), Shinyapps.io.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ExampleDatasets-0.1.1.tar.gz (71.3 kB view details)

Uploaded Source

Built Distribution

ExampleDatasets-0.1.1-py3-none-any.whl (71.0 kB view details)

Uploaded Python 3

File details

Details for the file ExampleDatasets-0.1.1.tar.gz.

File metadata

  • Download URL: ExampleDatasets-0.1.1.tar.gz
  • Upload date:
  • Size: 71.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.2

File hashes

Hashes for ExampleDatasets-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a8babeb0f1b66441a700e42edd493604150f77f2fe00dfd8a3a3f21406bd952d
MD5 1eb26aba316a958ecd42610bae77b2da
BLAKE2b-256 e5556c7e78e30147569548dc9c7ebdc31aad271f9b5cfa81b36ce3247252b3b3

See more details on using hashes here.

File details

Details for the file ExampleDatasets-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ExampleDatasets-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1f0a2887e00b4d7d41438da47e42f291f010f55f3f2e0f39f9b7a820d7121bf3
MD5 3b9f57ced139816261d357172ce9a22c
BLAKE2b-256 d4d9e014d245129adbc411a6b2500b0090ae801988866c868bb6d3188b78d1b7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page