Skip to main content

An interface for visualizing and analysing the see19 dataset

Project description

see19 Guide

A dataset and interface for visualizing and analyzing the epidemiology of Coronavirus Disease 2019 aka COVID19 aka C19

Current with version 0.4.0

Analysis

Please read my various deep dives with see19 exploring different aspects of COVID19.

How Effective Is Social Distancing?

What Factors Are Correlated With COVID19 Fatality Rates?

The COVID Dragons

Contents

  1. Purpose
  2. Getting Started
  3. the Data
    3.1 Data Sources
    3.2 Dataset Characteristics
    3.3 The Testset
    3.4 Disclaimer
  4. the CaseStudy Interface
    4.1 Basics
    4.2 Filtering
    4.3 Smoothing
    4.4 Available Factors
    4.5 Additional Flags
    4.6 RayStudy v BaseStudy
    4.7 Chart Objects
  5. compchart - Visualizing Regional Impacts
    5.1 Daily Fatalities Comparison - Italy
    5.2 Daily Fatalities Comparison - 10 Most Impacted Regions
    5.3 Varying the Categories
  6. compchart4D - Visualizing Factors in 4D
    6.1 From 3D to 4D
    6.2 More on the X-Axis
    6.3 How Far Can We Take It?
  7. heatmap - Visualizing with Color Maps
    7.1 Count Category v Single Factor
    7.2 Count Category v Multiple Factors
  8. barcharts - Comparing Regional Factors
  9. ScatterFlow for Large Sets
    9.1 substrinscat - for Strindex Sub-Categories
    9.2 scatterflow

1. Purpose

See19 is the single most comprehensive international COVID-19 dataset available.

Ease-of-use is paramount, thus, all data from all sources have been compiled into a single structure, readily consumed and manipulated in the ubiquitous csv format.

Along with the root data, a module is included with analysis and visualizations tools.

2. Getting Started

See19 is a dataset and a python package.

The dataset can be accessed directly here. Files are timestamped with creation date.

The package can be installed via pip.

pip install see19

3. the Data

3.1 Data Sources
3.2 Dataset Characteristics
3.3 The Testset
3.4 Disclaimer

The See19 dataset aggregates global data on COVID19 in various regions, as available data allows, and marries that data with available datasets on exogenous regional factors that might impact the epidemiology of the virus.

The dataset is compiled using Selenium, Django, SQLite, and Pandas.

COVID19 Data Characteristics:

  • Cumulative Cases for each region on each date
  • Cumulative Fatalities for each region on each date
  • State / Provincial-level data available for:
    • Australia
    • Brazil
    • Canada
    • China
    • Italy
    • United States
  • Country-level available for all other regions

Factor Data Characteristics available for most regions:

  • Longitude / Latitude
    • I just wrote a script that searched the region name on this website and pulled the coordinates from the resulting url
  • Population
  • Population demographic segmentation
  • Land Density
  • City Density (typically the density of the largest city in the region)
  • Climate Characteristics including:
    • Average daily temperature
    • Average daily dewpoint temperate
    • Average daily relative humidity (derived from temperature and dewpoint temperature)
    • Total daily UV-B Radiation
  • Air quality measures
  • Historical Health Outcomes
  • Travel Popularity
  • Social Distancing Implementation

Updated each morning.

3.1 Data Sources

COVID Case, Fatality, and Testing Data:

Other Data:

  • Longitude & Latitude
    • I just wrote a script that searched each region name on this site
    • Any errors were fixed manually
  • Population, Demographics, and Density from SEDAC
    • Matched to regional case data by name, often manually
  • Climate Data from European Centre for Medium-Range Weather Forecasts
    • Climate data pulled from nearest matching longitude & latitude coordinate in the dataset
  • Air Quality Data from the World Air Quality Project
    • Air quality data recorded at city-level, with limited number of cities available
    • City data is aggregated to the regional or country-level
    • So, where a region has mutiple cities reporting AQ data, the region value is aggregate of the cities
    • Where a region has only a single city, that city represents the whole region
    • Where a region has no cities, NADA
  • Social Distancing Stringency Index and Policy Indicators via Oxford Covid Government Response Tracker
  • Google Mobility Data
  • Apple Mobility Index
  • GDP Per Capita via the OECD and WorldBank
    • utilizing real 2016 Purchasing Power Parity figures indexed to 2015 US dollars
  • Causes of Death
  • Travel Popularity
    • An even messier hodgepodge of data pulled from the World Tourism Organization via indexmundi
    • State/Provincial data were derived from the country-level and other various sources in an ad-hoc fashion
    • Good travel data is surprisingly difficult to come by. There are a number of services that offer data on flight statistics, however, it is prohibitively expensive

3.2 Dataset Characteristics

With see19 installed, we can download the dataset via get_baseframe

import numpy as np
import pandas as pd
# from see19 import get_baseframe
from casestudy.see19.see19 import get_baseframe
bf = get_baseframe()
HBox(children=(FloatProgress(value=0.0, description='Find latest dataset...', layout=Layout(flex='2'), max=3.0…

The dataset is arranged such that each row is a unique entry for each region_id on each date

All other columns are the value of that particular factor in that particular region on that particular date

bf.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... genito childbirth perinatal congenital other external visitors travel_year gdp gdp_year
0 282 110 ABR Abruzzo ITA Italy 2020-01-01 NaN NaN NaN ... 442.0 1.0 16.0 19.0 384.0 2059 181458.0 2017.0 4.560860e+10 2016.0
1 282 110 ABR Abruzzo ITA Italy 2020-01-02 NaN NaN NaN ... 442.0 1.0 16.0 19.0 384.0 2059 181458.0 2017.0 4.560860e+10 2016.0
2 282 110 ABR Abruzzo ITA Italy 2020-01-03 NaN NaN NaN ... 442.0 1.0 16.0 19.0 384.0 2059 181458.0 2017.0 4.560860e+10 2016.0

3 rows × 132 columns

This could perhaps be more appropriately structured as a multi-index frame, however, I find such indexes cumbersome to work with.

'There are {} unique regions in the dataset'.format(bf.region_id.unique().size)
'There are 325 unique regions in the dataset'

Australia, Brazil, Canada, China, Italy, and the US have state/provincial level data.

For example, regions within Italy and Brazil are as follows:

bf[bf.country.isin(['Italy', 'Brazil'])].region_name.unique()
array(['Abruzzo', 'Acre', 'Alagoas', 'Amapa', 'Amazonas', 'Bahia',
       'Basilicata', 'Calabria', 'Campania', 'Ceara', 'Distrito Federal',
       'Emilia-Romagna', 'Espirito Santo', 'Friuli Venezia Giulia',
       'Goias', 'Lazio', 'Liguria', 'Lombardia', 'Maranhao', 'Marche',
       'Mato Grosso', 'Mato Grosso Do Sul', 'Minas Gerais', 'Molise',
       'P.A. Bolzano', 'P.A. Trento', 'Para', 'Paraiba', 'Parana',
       'Pernambuco', 'Piaui', 'Piemonte', 'Puglia', 'Rio De Janeiro',
       'Rio Grande Do Norte', 'Rio Grande Do Sul', 'Rondonia', 'Roraima',
       'Santa Catarina', 'Sao Paulo', 'Sardegna', 'Sergipe', 'Sicilia',
       'Tocantins', 'Toscana', 'Umbria', "Valle d'Aosta", 'Veneto'],
      dtype=object)
'Each region has {} dates in the dataset'.format(bf.date.unique().size)
'Each region has 202 dates in the dataset'
"""Thus, there are {:,.0f} rows in the dataset, with one row for each unique `region_id`-`date` combination""" \
.format(bf.date.shape[0])
'Thus, there are 65,650 rows in the dataset, with one row for each unique `region_id`-`date` combination'
"""There are currently {} columns in the dataset, most of which are observable factors""".format(bf.columns.size)
'There are currently 132 columns in the dataset, most of which are observable factors'

The factors can be seen as split between two types:

  • Time-static factors, i.e. do not change by the date.

    • population, density, population demographic ranges, cause of death outcomes, travel popularity
  • Time-dynamic factors, i.e. change with each date.

    • fatalities, climate, pollution, mobility, and the Oxford stringency index

They can be found as follows:

ny = bf[bf.region_name == 'New York']

static = []
dynamic = []
for col in ny.columns:
    if ny[col].unique().size > 1:
        dynamic.append(col)
    else:
        static.append(col)

bold = '\033[1m'
end = '\033[0m'
print ('{}***STATIC***{}\n'.format(bold, end), static)
print ('\n')
print ('{}***DYNAMIC***{}\n'.format(bold, end), dynamic)
***STATIC***
 ['region_id', 'country_id', 'region_code', 'region_name', 'country_code', 'country', 'population', 'land_KM2', 'land_dens', 'city_KM2', 'city_dens', 'A00_04B', 'A05_09B', 'A10_14B', 'A15_19B', 'A20_24B', 'A25_29B', 'A30_34B', 'A35_39B', 'A40_44B', 'A45_49B', 'A50_54B', 'A55_59B', 'A60_64B', 'A65_69B', 'A70_74B', 'A75_79B', 'A80_84B', 'A09UNDERB', 'A14UNDERB', 'A19UNDERB', 'A24UNDERB', 'A29UNDERB', 'A34UNDERB', 'A65PLUSB', 'A70PLUSB', 'A75PLUSB', 'A80PLUSB', 'A85PLUSB', 'A05_19B', 'A05_24B', 'A05_29B', 'A05_34B', 'A15_24B', 'A15_29B', 'A15_34B', 'A20_29B', 'A20_34B', 'A35_54B', 'A40_54B', 'A45_54B', 'A35_64B', 'A40_64B', 'A45_64B', 'pm10', 'precipitation', 'wd', 'uvi', 'aqi', 'pol', 'mepaqi', 'pm1', 'e3', 'e4', 'h4', 'h5', 'transit_apple', 'walking_apple', 'year', 'neoplasms', 'blood', 'endo', 'mental', 'nervous', 'circul', 'infectious', 'respir', 'digest', 'skin', 'musculo', 'genito', 'childbirth', 'perinatal', 'congenital', 'other', 'external', 'visitors', 'travel_year', 'gdp', 'gdp_year']


***DYNAMIC***
 ['date', 'cases', 'deaths', 'tests', 'co', 'dew', 'humidity', 'no2', 'o3', 'pm25', 'pressure', 'so2', 'temperature', 'wind gust', 'wind speed', 'wind-gust', 'wind-speed', 'temp', 'dewpoint', 'uvb', 'rhum', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'e1', 'e2', 'h1', 'h2', 'h3', 'strindex', 'retail_n_rec', 'groc_n_pharm', 'parks', 'transit', 'workplaces', 'residential', 'driving_apple']
'The entire set has {:,.0f} different data points'.format(bf.size)
'The entire set has 8,665,800 different data points'

3.3 The Testset

A separate dataset, referred to as the testset, is housed in the see19 repo in the testset folder. The testset will include new data (either additional factors or new regions) that has not yet been incorporated in the see19 interface. The goal is to integrate the new data into the interface over time. The testset will be update concurrently with the main dataset on an adhoc basis.

The existing see19 package is NOT be compatiable with the testset, HOWEVER you can download the testset via get_baseframe by setting test=True.

See the readme for additional data currently available in the testset.

bf_test = get_baseframe(test=True)
HBox(children=(FloatProgress(value=0.0, description='Find latest testset...', layout=Layout(flex='2'), max=3.0…

3.4 Disclaimer

I have said before and it bears repeating: This is an imperfect dataset. Specific problems are highlighted here.

GENERAL ISSUES

  • Not all factors have available measurements for each region or each date.

    • These are typically expressed as NaN
  • Some factors are available at regional levels while others are not

    • Measurements for a region are often compared to other measurements at the country level. This isn't necessarily problematic ... for large geographic and populous countries like the US, it is likely better that state-level data is used to compare to other smaller countries.
    • State-level measurements are often estimate by mixing separate data sources. For instance, Visitor data for the provinces of Brazil was estimated by taking the country-level data from the World Tourism Organization and weighting it by the province's proportionate share in visitor travel from separate data from the Brazilian government.
  • Some data is outdated.

    • GDP data lags signficantly particularly for large groups of countries, so 2016 figures have been used, presuming that the relative mix among countries has remained constant

DENSITY

Population density is oft-cited as a potential explanatory factor in COVID19 infection rates. And I couldn't agree more that it is important to consider. However, the study of density suffers from many issues.

  • Denisty is highly variable within regions. And case and fatality rates have been highly variable within regions and across densities. In New York City, for example, some of the least dense regions have had the highest infection rates.

  • With only regional data available, to be rigourous the safest option is to simple choose the density of the region. However, this is often a poor reflection of reality. New York State actually has signficant land mass despite most of its population residing on a tiny island on the southeastern edge.

  • To account for this, See19 includes a factor city_dens. city_dens is the density of the largest city in the region, so :

    • for New York State, city_dens is the density of New York City,
    • for Taiwan, city_dens is the density of Taipei,
    • for Japan, city_dens is the density of Tokyo, and so on.

    This approach results in its own issues. For instance, at present, for all of Russia, city_dens reflects the density of Moscow.

Other geographic measurements, such as temperature and uvb radiation suffer from similar issues.

The only true way to address these shortcomings is for daily case and fatality statistics to be released at the county-level (or equivalent) in every country around the globe.

CASE DATA

Aside from just the difficulties of aggregating data, there are well-documented issues with the underlying case and fatality counts as well.

  • Confirmed cases are likely well below actual cases given up to 50% of all COVID19 cases may be asymptomatic and limited testing in the early stages led to many symptomatic cases going unreported.

  • The rapid improvement in testing likely exaggerated the growth of infections over time

  • Fatalities were unreported at peak periods due to lack of health care capacity

  • Fatalities have been retroactively added to data, without adjusting back to the days the fatalities actually occured, so for regions like Hubei and New York state, there are massive spikes in fatalities that don't reflect the actual experience.

  • China has been heavily criticized for under-reporting, late-reporting, and recently added ~20% increase in cumulative fatalities on a random day in March. For these reasons, throughout this tutorial, you will see that China is often excluded from the dataset.

TESTING

Testing statistics are still a bit of a mess internationally. For instance, many European countries only report cumulative test counts on a weekly basis and many have only begun reporting in the vary recent past. Different methods of interpolation are available in the CaseStudy interface.

  • Brazil is not currently included in tests data. Brazil test counts are only currently available on the country level whereas case and fatality data is available on a regional level. Methods are being considered to allocate aggregate tests among the regions (perhaps simply as percentage of population or cases counts).

4. the Casestudy Interface

4.1 Basics
4.2 Filtering
4.3 Smoothing
4.4 Available Factors
4.5 Additional Flags
4.6 RayStudy v BaseStudy
4.7 Chart Objects

See19 Visualization and Data analysis is completed via the CaseStudy class. CaseStudy provides attributes and methods for filtering, manipulating, appending, and visualizing data in the baseframe.

CaseStudy can be accessed directly from the see19 module. To initialize, simply pass the baseframe.

# from see19 import CaseStudy
from casestudy.see19.see19 import CaseStudy
casestudy = CaseStudy(bf)

4.1 Basics

The original baseframe can be accessed via the baseframe attribute

casestudy.baseframe.head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... genito childbirth perinatal congenital other external visitors travel_year gdp gdp_year
0 282 110 ABR Abruzzo ITA Italy 2020-01-01 NaN NaN NaN ... 442.0 1.0 16.0 19.0 384.0 2059 181458.0 2017.0 4.560860e+10 2016.0
1 282 110 ABR Abruzzo ITA Italy 2020-01-02 NaN NaN NaN ... 442.0 1.0 16.0 19.0 384.0 2059 181458.0 2017.0 4.560860e+10 2016.0

2 rows × 132 columns

CaseStudy automatically computes different adjustments including:

  1. Daily new cases, fatalities, and tests (called count_types)
  2. Daily Moving Average (DMA) for new and cumulative count_types
  3. Population and density adjustments for new and cumulative count_types
  4. Daily growth or change in 1. thru 3. above

These adjustments are referred to as count_categories. Additional adjustments are available via kwargs to be discussed below.

Ajustments are added to the dataset by calling the make method. The amended dataset is the accessible via the df attribute.

casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=502.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))

The amended dataframe can be accessed via the df attribute:

casestudy.df.head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... growth_cases_per_person_per_city_KM2 growth_deaths_per_1K growth_deaths_per_1M growth_deaths_per_person_per_land_KM2 growth_deaths_per_person_per_city_KM2 growth_tests_per_1K growth_tests_per_1M growth_tests_per_person_per_land_KM2 growth_tests_per_person_per_city_KM2 days
43906 32 110 TRE P.A. Trento ITA Italy 2020-03-13 216.699585 1.87999 803.712436 ... 1.523364 2.0 2.0 2.0 2.0 1.426644 1.426644 1.426644 1.426644 0 days
43907 32 110 TRE P.A. Trento ITA Italy 2020-03-14 273.865733 1.87999 955.714788 ... 1.263804 1.0 1.0 1.0 1.0 1.189125 1.189125 1.189125 1.189125 1 days

2 rows × 140 columns

NOTE: Ray and Numba are utilized to significantly improve the speed of make. Ray is not compatible with Windows. CaseStudy will attempt to detect incompatibility and revert to a single-process method where applicable.

More in Section 4.5

For ease of selection, CaseStudy has a number of class attributes with different groupings of count categories: BASECOUNT_CATS, PER_CATS, LOGNAT_CATS, LOG_CATS, ALL_CATS, DMA_COUNT_CATS, PER_COUNT_CATS.

DMA_COUNT_CATS is shown as an example:

CaseStudy.DMA_COUNT_CATS[:10]
['cases_dma',
 'cases_new_dma',
 'deaths_dma',
 'deaths_new_dma',
 'tests_dma',
 'tests_new_dma',
 'cases_dma_per_1K',
 'cases_dma_per_1M',
 'cases_dma_per_person_per_land_KM2',
 'cases_dma_per_person_per_city_KM2']

Both the log10 and natural of each of 1. thru 3. above are available for presentation purposes. Simply provide log=True and/or lognat=True and/or .

casestudy.log = True
casestudy.lognat = True
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=502.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))
casestudy.df[['region_name', 'date'] + [col for col in casestudy.df if 'log' in col]].head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_name date cases_dma_log cases_new_log cases_new_dma_log deaths_dma_log deaths_new_log deaths_new_dma_log tests_dma_log tests_new_log ... growth_cases_per_person_per_land_KM2_lognat growth_cases_per_person_per_city_KM2_lognat growth_deaths_per_1K_lognat growth_deaths_per_1M_lognat growth_deaths_per_person_per_land_KM2_lognat growth_deaths_per_person_per_city_KM2_lognat growth_tests_per_1K_lognat growth_tests_per_1M_lognat growth_tests_per_person_per_land_KM2_lognat growth_tests_per_person_per_city_KM2_lognat
43906 P.A. Trento 2020-03-13 2.186879 1.871859 1.691872 -0.026874 -0.026874 -0.202966 2.794193 2.380851 ... -1.014299 -1.014299 0.890089 2.152714 0.867427 0.867427 4.976355 1.050782 1.304384 1.304384
43907 P.A. Trento 2020-03-14 2.324156 1.757139 1.757139 0.194974 NaN -0.202966 2.888888 2.181850 ... 2.104604 2.104604 1.000000 1.000000 1.000000 1.000000 1.389530 1.023559 1.113758 1.113758

2 rows × 242 columns

'In total, there are {} different `count_categories` to choose from.'.format(len(CaseStudy.ALL_COUNT_CATS))
'In total, there are 180 different `count_categories` to choose from.'

4.2 Filtering

Thankfully, casestudy.df can be limited to specific count categories via the count_categories attribute:

casestudy.count_categories = ['tests_new_dma_per_person_per_land_KM2']
casestudy.make()
casestudy.df.head(2)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=502.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests population land_KM2 land_dens city_KM2 city_dens tests_new_dma_per_person_per_land_KM2 days
43906 32 110 TRE P.A. Trento ITA Italy 2020-03-13 216.699585 1.87999 803.712436 515201.0 2938.79544 175.310262 2938.79544 175.310262 0.807438 0 days
43907 32 110 TRE P.A. Trento ITA Italy 2020-03-14 273.865733 1.87999 955.714788 515201.0 2938.79544 175.310262 2938.79544 175.310262 0.865241 1 days

When passing kwargs to CaseStudy at initialization, most kwargs will accept either a string for a single category or a list (or other iterable) for multiple. When assigning to an instance attribute, an interable must be passed

casestudy = CaseStudy(bf, count_categories='tests_new_dma_per_person_per_land_KM2')
casestudy.make()
casestudy.df[['region_name', 'date', 'tests_new_dma_per_person_per_land_KM2']].head(2)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=502.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_name date tests_new_dma_per_person_per_land_KM2
43906 P.A. Trento 2020-03-13 0.807438
43907 P.A. Trento 2020-03-14 0.865241
casestudy.count_categories = ['deaths_new_dma_per_person_per_land_KM2', 'growth_cases_new_per_1M']
casestudy.make()
casestudy.df.head(2)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=502.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests population land_KM2 land_dens city_KM2 city_dens deaths_new_dma_per_person_per_land_KM2 growth_cases_new_per_1M days
43906 32 110 TRE P.A. Trento ITA Italy 2020-03-13 216.699585 1.87999 803.712436 515201.0 2938.79544 175.310262 2938.79544 175.310262 0.003575 1.866667 0 days
43907 32 110 TRE P.A. Trento ITA Italy 2020-03-14 273.865733 1.87999 955.714788 515201.0 2938.79544 175.310262 2938.79544 175.310262 0.003575 0.767857 1 days

CaseStudy can further filter baseframe as follows:

  • regions to limit the frame to certain regions
  • countries to limit the frame to certain countries
  • exclude_regions to exclude certain regions
  • exclude_countries to exclude certain countries

Specific regions can be included or excluded by providing the region_name, region_code, or region_id. Specific countries can be included or excluded by providing the country, country_code, or country_id.

Each of the four parameters can accept a single region as a str object or multiple regions via several common iterables.

Below we select three regions:

regions = ['New York', 'FL', 35]
casestudy = CaseStudy(
    bf, regions=regions, count_categories=CaseStudy.BASECOUNT_CATS, 
)
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=5.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

We can see that all three regions are indeed in the object by grouping:

pd.concat([df_group.iloc[:1] for region_id, df_group in casestudy.df.groupby('region_id')]).head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... cases_dma cases_new cases_new_dma deaths_dma deaths_new deaths_new_dma tests_dma tests_new tests_new_dma days
53399 35 110 SIC Sicilia ITA Italy 2020-03-12 102.712067 2.000000 973.321711 ... 77.406196 28.580749 15.778955 0.666667 2.000000 0.666667 796.493912 186.492921 140.803254 0 days
17846 64 236 FL Florida USA United States of America (the) 2020-03-11 28.000000 2.526828 329.000000 ... 21.666667 9.000000 3.666667 0.842276 2.526828 0.842276 242.666667 88.000000 64.666667 0 days
40070 75 236 NY New York USA United States of America (the) 2020-03-15 729.000000 3.143533 6916.080830 ... 558.000000 205.000000 171.000000 1.047844 3.143533 1.047844 5149.016931 2583.035500 2170.676861 0 days

3 rows × 25 columns

The region and country filters are important mechanisms for isolating data.

Here, we focus on US regions only, but exclude some of the most impacted ones:

casestudy.countries = ['USA']
casestudy.excluded_regions = ['NY', 'NJ']
casestudy.regions = None
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=120.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=48.0), HTML(value='')))

Because certain regions were assigned in the previous CaseStudy instantiation, we must set regions=None above in order to ask ALL the regions of the baseframe.

And below we can see that we have various US states in the dataset and that New York or New Jersey are not included.

casestudy.df.region_name.unique()
array(['Alabama', 'Wyoming', 'Alaska', 'Arkansas', 'Delaware', 'Idaho',
       'Maine', 'Mississippi', 'Montana', 'New Mexico', 'North Dakota',
       'South Dakota', 'West Virginia', 'Michigan', 'Vermont', 'Georgia',
       'Colorado', 'Florida', 'Oregon', 'Texas', 'Illinois',
       'Pennsylvania', 'Iowa', 'Maryland', 'North Carolina', 'Washington',
       'California', 'Massachusetts', 'Oklahoma', 'Arizona',
       'Connecticut', 'Minnesota', 'Virginia', 'New Hampshire', 'Hawaii',
       'Nevada', 'Indiana', 'Kentucky', 'District of Columbia',
       'Missouri', 'Louisiana', 'Ohio', 'Wisconsin', 'Kansas', 'Utah',
       'Tennessee', 'South Carolina', 'Nebraska'], dtype=object)
pd.concat([df_group.iloc[:1] for region_id, df_group in casestudy.df.groupby('region_id')]).head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... cases_dma cases_new cases_new_dma deaths_dma deaths_new deaths_new_dma tests_dma tests_new tests_new_dma days
691 44 236 AL Alabama USA United States of America (the) 2020-03-26 558.514091 1.26695 10468.861581 ... 369.399307 246.143562 124.727455 0.422317 1.26695 0.422317 7859.521030 3287.002892 1929.975539 0 days
64339 48 236 WY Wyoming USA United States of America (the) 2020-04-13 316.114653 1.00000 9715.352851 ... 305.385913 16.093110 8.429724 0.333333 1.00000 0.333333 9166.923029 822.644733 529.424828 0 days
1094 49 236 AK Alaska USA United States of America (the) 2020-03-25 53.977249 1.00000 3783.772189 ... 42.839087 7.711036 8.567817 0.333333 1.00000 0.333333 2745.528371 1496.950677 539.260259 0 days

3 rows × 25 columns

casestudy.df[casestudy.df.region_name.isin(['NY', 'NJ'])]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... cases_dma cases_new cases_new_dma deaths_dma deaths_new deaths_new_dma tests_dma tests_new tests_new_dma days

0 rows × 25 columns

Limiting data via different start and tail hurdles

Parameters exist that allow you to filter the dataset such that regions and days appear only if they meet certain criteria.

start_factor and start_hurdle provide the ability to effectively crop the beginning of region's period of data.

tail_factor and tail_hurdle do the same for the end of a region's period.

start_factor and tail_factor accept any dynamic factor in the dataset (including date).

The hurdle is the level of the specified factor the region must reach to be included. For instance, if start_factor=cases_new_per_1M and start_hurdle=100, each region's first row in casestudy.df will be the day that the region met or exceeded 100 new cases per 1 million people.

These options are a convenient way to compare regions that have been impacted to a similar extent or, perhaps, to fairly compare regions that were impacted at different times.

The default parameters for start_factor and start_hurdle limit the data to regions with at least one cumulative fatality.

NOTE: a days column is added to casestudy.df. This is a count of the number of days from the current date back to the first date in the casestudy. When a start_factor is provided, this is the first date that the start_hurdle is met. When start_factor is not provided, this is the first date in the dataset.

Examples are show below.

casestudy = CaseStudy(
    bf, regions='Spain', count_categories=CaseStudy.BASECOUNT_CATS, 
    start_factor='cases', start_hurdle=1000
)
casestudy.make()
casestudy.df.head(2)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... cases_dma cases_new cases_new_dma deaths_dma deaths_new deaths_new_dma tests_dma tests_new tests_new_dma days
55820 491 209 ESP Spain ESP Spain 2020-03-09 1057.840245 27.344784 NaN ... 738.089217 394.348647 221.163866 17.904323 10.742594 7.487262 NaN NaN NaN 0 days
55821 491 209 ESP Spain ESP Spain 2020-03-10 1671.052390 34.180981 NaN ... 1130.794744 613.212146 392.705527 26.042652 6.836196 8.138329 NaN NaN NaN 1 days

2 rows × 25 columns

casestudy = CaseStudy(
    bf, countries='Sweden', 
    count_categories='deaths_new', start_factor='deaths_new', start_hurdle=100
)
casestudy.make()
casestudy.df.head(2)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests population land_KM2 land_dens city_KM2 city_dens deaths_new days
56656 495 214 SWE Sweden SWE Sweden 2020-04-06 7438.936775 675.770207 NaN 9415570.0 415314.854224 22.67092 2150.411192 4378.497486 107.669886 0 days
56657 495 214 SWE Sweden SWE Sweden 2020-04-07 7941.679240 837.275037 NaN 9415570.0 415314.854224 22.67092 2150.411192 4378.497486 161.504829 1 days

To see the earliest dates in the dataframe, prior to any deaths being recorded, set start_factor to ''.

casestudy.countries = None
casestudy.regions = ['RJ']
casestudy.count_categories = ['tests_new_dma']
casestudy.factors = ['temp', 'strindex']
casestudy.start_factor = ''
casestudy.make()
casestudy.df.head(2)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=3.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests population land_KM2 land_dens city_KM2 city_dens tests_new_dma temp strindex days
48480 557 31 RJ Rio De Janeiro BRA Brazil 2020-01-01 NaN NaN NaN 15962668.0 42269.311478 377.642016 2203.766328 7243.357792 NaN 294.134674 0.0 0 days
48481 557 31 RJ Rio De Janeiro BRA Brazil 2020-01-02 NaN NaN NaN 15962668.0 42269.311478 377.642016 2203.766328 7243.357792 NaN 294.375153 0.0 1 days

4.3 Smoothing

Smoothing is applied two ways within the make method.

The first addresses NaN values within the count_type time-series. Sometimes there are artifacts and one-offs within the set. Other times, as with test counts in many regions, the count is only update periodically and NaNs fill the gaps.

In these instances, make interpolates between the real values to fill in the gaps. The default method is linear interpolation, but this can be overriden by providing interpolation_method (see Pandas docs for options).

For instance, below we see that Spain testing data as follows:

casestudy = CaseStudy(bf, regions='Spain')
casestudy.make()
casestudy.df.tests.tail(20)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=3.0, st…


2020-08-02 06:17:58,268	INFO resource_spec.py:212 -- Starting Ray with 12.84 GiB memory available for workers and up to 6.44 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-08-02 06:17:58,495	WARNING services.py:923 -- Redis failed to start, retrying now.
2020-08-02 06:17:58,792	INFO services.py:1165 -- View the Ray dashboard at localhost:8265



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))





55934    3.619554e+06
55935    3.644458e+06
55936    3.673778e+06
55937    3.703099e+06
55938    3.732419e+06
55939    3.761740e+06
55940    3.791060e+06
55941    3.820381e+06
55942    3.849701e+06
55943    3.881696e+06
55944    3.913690e+06
55945    3.945685e+06
55946    3.977680e+06
55947    4.009675e+06
55948    4.041669e+06
55949    4.073664e+06
55950    4.073664e+06
55951    4.073664e+06
55952    4.073664e+06
55953    4.073664e+06
Name: tests, dtype: float64

But when we set interpolate=Flase, we can see that in fact Spain updates its testing only weekly.

casestudy = CaseStudy(bf, regions='Spain', interpolate=False)
casestudy.make()
casestudy.df.tests.tail(20)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))





55934          NaN
55935    3644458.0
55936          NaN
55937          NaN
55938          NaN
55939          NaN
55940          NaN
55941          NaN
55942    3849701.0
55943          NaN
55944          NaN
55945          NaN
55946          NaN
55947          NaN
55948          NaN
55949    4073664.0
55950          NaN
55951          NaN
55952          NaN
55953          NaN
Name: tests, dtype: float64

The second approach is new in 0.3.6. CaseStudy automatically applies smoothing to negative values and large outliers in the main count_categories (cases, deaths, and tests).

Many regions have chosen to "adjust" or "catch up" their case or fatality counts, not be adjusting the actual dates that the outcome occured, but instead on a seemingly random reporting date. This creates strange artifacts in the time series.

For example, Spain has dip in daily case counts to the negative in late April 2020:

casestudy = CaseStudy(bf, regions='Spain', smooth=False)
casestudy.make()
casestudy.compchart.make(x_category='date', y_category='deaths_new', figsize=(8,4))
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=1.0, st…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))



Daily Deaths

png

With smooth=True (the default setting), this deep negative value is redistributed through prior dates according to the distribution of counts up to the date with the negative value.

This is a somewhat nieve approach but has the benefit of maintaining a consistent shape to the time-series.

casestudy = CaseStudy(bf, regions='Spain', smooth=True)
casestudy.make()
casestudy.compchart.make(x_category='date', y_category='deaths_new', figsize=(8,4))
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


Daily Deaths

png

The same adjustment is made for VERY large increases in counts relative to the cumulative total and to the daily rate. For example, see New York below:

casestudy = CaseStudy(bf, regions='NY', smooth=False)
casestudy.make()
casestudy.compchart.make(x_category='date', y_category='deaths_new', figsize=(8,4))
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=1.0, st…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


Daily Deaths

png

casestudy = CaseStudy(bf, regions='NY', smooth=True)
casestudy.make()
casestudy.compchart.make(x_category='date', y_category='deaths_new', figsize=(8,4))
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


Daily Deaths

png

4.4 Available Factors

The remaining columns in the baseframe can be included in a CaseStudy instance on an opt-in basis via the factors attribute:

casestudy = CaseStudy(bf, count_categories='cases_new_per_person_per_land_KM2', factors=['no2', 'strindex'])
casestudy.make()
casestudy.df.head(2)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=659.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests population land_KM2 land_dens city_KM2 city_dens cases_new_per_person_per_land_KM2 no2 strindex days
43905 32 110 TRE P.A. Trento ITA Italy 2020-03-12 131.523112 1.096661 652.429603 515201.0 2938.79544 175.310262 2938.79544 175.310262 0.210345 NaN 85.19 0 days
43906 32 110 TRE P.A. Trento ITA Italy 2020-03-13 200.357639 2.193322 930.784897 515201.0 2938.79544 175.310262 2938.79544 175.310262 0.392644 NaN 85.19 1 days

For convenience, a number of factor groupings can be accessed via CaseStudy attributes:

  • GMOBIS, AMOBIS, CAUSES, MAJOR_CAUSES, POLLUTS, TEMP_MSMTS, MSMTS
    • various groupings for factor data
    • GMOBIS refer to Google Mobility data.
    • AMOBIS refer to Apple Mobility data.
  • STRINDEX_CATS, CONTAIN_CATS, ECON_CATS, HEALTH_CATS
    • groupings for the Oxford Stringency Index
print (CaseStudy.MSMTS)
print (CaseStudy.MAJOR_CAUSES)
['uvb', 'rhum', 'temp', 'dewpoint']
['circul', 'infectious', 'respir', 'endo']

Different demographic population age groupings can be accessed as well:

  • ALL_RANGES - all the possible demographic age ranges
  • RANGES - a dictionary of various groupings of age ranges
from see19 import RANGES
RANGES.keys()
dict_keys(['UNDERS', 'OVERS', 'SCHOOL_GOERS', 'Y_MILLS', 'MILLS', 'MID', 'MID_PLUS'])
overs = RANGES['OVERS']['ranges']
casestudy = CaseStudy(bf, regions='Lombardia', count_categories='deaths_new_per_person_per_land_KM2', factors=overs)
casestudy.make()
casestudy.df.head(2)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... A70PLUSB A75PLUSB A80PLUSB A85PLUSB A65PLUSB_% A70PLUSB_% A75PLUSB_% A80PLUSB_% A85PLUSB_% days
31566 36 110 LOM Lombardia ITA Italy 2020-02-24 216.225177 6.0 943.732875 ... 1490749.0 963768.0 0.0 0.0 0.208224 0.154784 0.100068 0.0 0.0 0 days
31567 36 110 LOM Lombardia ITA Italy 2020-02-25 301.709549 9.0 2386.747531 ... 1490749.0 963768.0 0.0 0.0 0.208224 0.154784 0.100068 0.0 0.0 1 days

2 rows × 27 columns

casestudy = CaseStudy(bf, regions='LOM', count_categories='deaths_new_per_person_per_land_KM2', factors=CaseStudy.MAJOR_CAUSES)
casestudy.make()
casestudy.df.head(2)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... deaths_new_per_person_per_land_KM2 circul infectious respir endo circul_% infectious_% respir_% endo_% days
31566 36 110 LOM Lombardia ITA Italy 2020-02-24 216.225177 6.0 943.732875 ... NaN 74695 4630 20185 6566.0 0.007756 0.000481 0.002096 0.000682 0 days
31567 36 110 LOM Lombardia ITA Italy 2020-02-25 301.709549 9.0 2386.747531 ... 0.00507 74695 4630 20185 6566.0 0.007756 0.000481 0.002096 0.000682 1 days

2 rows × 25 columns

Some factors are only available at a country level.

By setting country_level=True, casestudy will aggregate most data among the subregions up to the country level to allow for proper comparison across the broad range of countries.

The Oxford Stringency Index and its derivatives is one such data group only available at the country level.

casestudy = CaseStudy(bf, 
    count_categories='deaths_new_per_person_per_land_KM2', 
    factors='strindex',
    country_level=True,
)
casestudy.make()
casestudy.df.tail(2)
/Users/spindicate/Documents/programming/zooscraper/casestudy/see19/see19/study/ray.py:16: UserWarning: smoothing is unavailable when country_level=True
  super().__init__(*args, **kwargs)



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…






HBox(children=(FloatProgress(value=0.0, max=155.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests population land_KM2 land_dens city_KM2 city_dens deaths_new_per_person_per_land_KM2 strindex days
36560 id_for_USA 236 USA name_for_USA USA United States of America (the) 2020-07-19 3725463.0 131737.0 45313502.0 307692971.0 9.087502e+06 33.858916 710152.024025 433.277609 15.446448 68.98 144 days
36561 id_for_USA 236 USA name_for_USA USA United States of America (the) 2020-07-20 3782891.0 132095.0 46043131.0 307692971.0 9.087502e+06 33.858916 710152.024025 433.277609 10.573286 68.98 145 days

Above you can see that all US states have been aggregated into a single region with an region_id

With respect to the STRINDEX_CATS subgroups, if all the required categories are provided, CaseStudy will sum the individual category values.

For example, if CONTAIN_CATS are provided, the aggregate of the eight categories will be included in the c_sum column.

Note if all five h indicators are provided, CaseStudy will also tabulate a key3_sum, which aggregates the scores on the h1, h2, and h3 indicators.

casestudy = CaseStudy(bf, 
    count_categories='deaths_new_per_person_per_land_KM2', 
    factors=CaseStudy.CONTAIN_CATS,
    country_level=True,
)
casestudy.make()
casestudy.df.tail(2)
/Users/spindicate/Documents/programming/zooscraper/casestudy/see19/see19/study/ray.py:16: UserWarning: smoothing is unavailable when country_level=True
  super().__init__(*args, **kwargs)



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, max=155.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... c1 c2 c3 c4 c5 c6 c7 c8 c_sum days
36560 id_for_USA 236 USA name_for_USA USA United States of America (the) 2020-07-19 3725463.0 131737.0 45313502.0 ... 3.0 2.0 2.0 4.0 1.0 2.0 2.0 3.0 19.0 144 days
36561 id_for_USA 236 USA name_for_USA USA United States of America (the) 2020-07-20 3782891.0 132095.0 46043131.0 ... 3.0 2.0 2.0 4.0 1.0 2.0 2.0 3.0 19.0 145 days

2 rows × 26 columns

Additional computations can be added for each factor via the factor_dmas attribute.

The attribute is a dictionary of the form str(factor_name): int(dma).

When provided, CaseStudy will automatically add _dma, _growth, and _growth_dma computations

casestudy = CaseStudy(bf, count_categories='deaths_new_dma_per_1M', 
    factors=['temp', 'c1', 'strindex'], 
    factor_dmas={'temp': 7, 'c1': 14},
    country_level=True,
)
casestudy.make()
casestudy.df.head(2)
/Users/spindicate/Documents/programming/zooscraper/casestudy/see19/see19/study/ray.py:16: UserWarning: smoothing is unavailable when country_level=True
  super().__init__(*args, **kwargs)



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, max=155.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... temp c1 strindex temp_dma temp_growth temp_growth_dma c1_dma c1_growth c1_growth_dma days
81 293 1 AFG Afghanistan AFG Afghanistan 2020-03-22 40.0 1.0 NaN ... 10.778741 3.0 41.67 7.908977 1.067747 1.384819 1.928571 1.0 NaN 0 days
82 293 1 AFG Afghanistan AFG Afghanistan 2020-03-23 40.0 1.0 NaN ... 8.560785 3.0 41.67 8.784692 0.794229 1.150845 2.142857 1.0 NaN 1 days

2 rows × 26 columns

NOTE: When country_level=True, smooth is currently NOT available as per warning and Ray multi-processing is also NOT available.

To provide a single dma for all the factors submitted, build the dictionary ahead of time:

factor_dmas = {msmt: 14 for msmt in CaseStudy.MSMTS}
casestudy = CaseStudy(
    bf, count_categories='tests_new_per_1M', 
    factors=CaseStudy.MSMTS, factor_dmas=factor_dmas
)
casestudy.make()
casestudy.df.head(2)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=659.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id country_id region_code region_name country_code country date cases deaths tests ... rhum_dma rhum_growth rhum_growth_dma temp_dma temp_growth temp_growth_dma dewpoint_dma dewpoint_growth dewpoint_growth_dma days
43905 32 110 TRE P.A. Trento ITA Italy 2020-03-12 131.523112 1.096661 652.429603 ... 90.025840 1.050915 0.996733 3.513738 0.959184 1.105750 -3.142554 1.896068 -0.635699 0 days
43906 32 110 TRE P.A. Trento ITA Italy 2020-03-13 200.357639 2.193322 930.784897 ... 89.967379 0.995192 1.001809 3.242550 1.053689 1.114479 -3.447804 1.026207 -0.735813 1 days

2 rows × 33 columns

Other factors are adjusted to population. These factors are appended with _% and can be seen via the pop_cats attribute.

These are typically time-static factors.

casestudy = CaseStudy(bf, count_categories='deaths_new_dma_per_1M', factors=['visitors', 'gdp', 'A65PLUSB' ])
print (casestudy.pop_cats)
casestudy.make()
casestudy.df[['region_name', 'date', 'visitors_%', 'gdp_%', 'A65PLUSB_%']].head(2)
['A65PLUSB', 'visitors', 'gdp']



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=659.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_name date visitors_% gdp_% A65PLUSB_%
43905 P.A. Trento 2020-03-12 19.864474 54504.746691 0.203018
43906 P.A. Trento 2020-03-13 19.864474 54504.746691 0.203018

4.5 Additional Flags

There are several additional flags and methods that will be touched on briefly, however, you are encouraged to read the analysis pages to see them in action.

  • world_averages: when set to True, averages each date in the dataset across all the regions, to provide a per_region statistic for each factor

  • favor_earlier: when set to True, scales any selected rows such that values earlier in the dataset receive more weight than later ones. A new column is added with the _earlier suffix. This is helpful when attempting to study the impacts of early moves to, say, social distance. Factors are selected by passing a list to the factors_to_favor_earlier parameter.

4.6 RayStudy v BaseStudy

The default implementation of make utilizes both Ray and Numba to significantly improve the performance.

Ray is a 3rd party multi-processing package. For see19 purposes, Ray's key feature is the ability to share (albeit read-only) large objects among different live processes. Python's standard multi-processing module does not allow for simple access to the baseframe and, therefore, did not provide any performance benefits.

Numba provides just-in-time compiling of certain numpy implementations. The custom Numba function typically provides 10x speed improvement versus the same built-in Pandas method.

Ray is not compatible with Windows. CaseStudy will attempt to detect incompatibility and revert to a single-process method where necessary.*

To support this, a root BaseStudy implementation provides single process functionality and a RayStudy child that implements Ray functionality. CaseStudy inherits from either class automatically based on operating system.

You can see which class is inherited as per below (this is on a Macbook)

CaseStudy.__bases__
(casestudy.see19.see19.study.ray.RayStudy,)

To use the non-Ray implementation, you can either import BaseStudy directly or set use_ray=False on CaseStudy.

We can see both approaches provide similar results below.

# from see19.study.base import BaseStudy
from casestudy.see19.see19.study.base import BaseStudy
from datetime import datetime as dt
def clockwrap(func):
    def wrapper(*args, **kwargs):
        start = dt.now()
        func()
        end = dt.now()

        return end - start

    return wrapper()
casestudy = BaseStudy(bf)
dur1 = clockwrap(casestudy.make)
print (dur1)
/Users/spindicate/Documents/programming/envs/zooenv/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: It looks like you called BaseStudy directly. This is not recommended. Ray provides significant performance improvements and certain BaseStudy methods are not optimized.
  """Entry point for launching an IPython kernel.



HBox(children=(FloatProgress(value=0.0, max=537.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=298.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))


0:00:28.674439
casestudy = CaseStudy(bf, use_ray=False)
dur2 = clockwrap(casestudy.make)
print (dur2)
/Users/spindicate/Documents/programming/envs/zooenv/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: use_ray set to False. This is not recommended. Ray provides significant performance improvements and certain BaseStudy methods are not optimized.
  """Entry point for launching an IPython kernel.



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, max=537.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=298.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))


0:00:27.573194

Now we'll compare that with the default Ray implemenation on an 8-core MacBook Pro.

casestudy = CaseStudy(bf)
dur3 = clockwrap(casestudy.make)
print (dur3)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=659.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))


0:00:06.225569
diff = 1 - dur3 / (np.mean([dur1, dur2]))
print ('You can see that the Ray implementation is \033[4m\033[1m{:.2%}\033[0m faster.'.format(diff))
You can see that the Ray implementation is 77.86% faster.

Note: Both Numba and Ray perform caching on the first call of a function. Thus, on the first session call to make() method, there will be additional delay (due to many functions being cached). All subsequent calls will experience the significant performance improvements.

4.7 Chart Objects

Each casestudy object currently contains 6 different chart objects, that provide visual tools for analysising, assessing and comparing COVID-19s impact on different regions and factors. Each chart is created via matplotlib. Details of each chart object are provided in future sections.

The chart classes can be found in the chart module, along with the BaseChart root which provides common functionality.

compchart from CompChart2D
compchart4d from CompChart4D
heatmap from HeatMap
barcharts from BarCharts
scatterflow from ScatterFlow
substrinscat from SubStrindexScatter

Each chart has been designed to align closely with the CaseStudy functionality and with the underlying functionality of matplotlib.

For instance, each chart is called via the make method.

casestudy.regions = ['NY', 'NJ']
casestudy.make()
leg = {'fontsize': 12, 'handlelength': 1}
casestudy.compchart.make(x_category='days', y_category='cases', figsize=(8,4), legend_params=leg)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=5.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))


Cumulative Cases

png

Each chart object is automatically updated on each make call, so any changes to the casestudy object, will also be reflected in the charts.

casestudy.regions = ['AB', 'ON']
casestudy.make()
casestudy.compchart.make(x_category='days', y_category='cases', figsize=(8,4), legend_params=leg)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=4.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))


Cumulative Cases

png

Note a prior version of see19 implemented compchart using Bokeh. This chart is deprecated and replaced with a matplotlib version but is still avialable under CompChart2DBokeh.

5. compchart - Visualizing Regional Impacts

5.1 Daily Fatalities Comparison - Italy
5.2 Daily Fatalities Comparison - 5 Most Impacted Regions
5.3 Varying the Categories

compchart attribute is an instance of the CompChart2D class and provides standard line graphs comparing regions on different categories provided to x_category & y_category. Time-series is supported when x_category='date'.

Charts are available in multi-line format with optional overlay of a second factor on a separate y-axis.

5.1 Daily Fatalities Comparison - Italy

We will illustrate with an example, focusing on only the three most impacted regions in Italy.

itaregs = bf[bf['country'] == 'Italy'] \
    .sort_values(by='deaths', ascending=False).region_name.unique().tolist()[:3]

casestudy = CaseStudy(bf, regions=itaregs, start_hurdle=3, start_factor='deaths', smooth=False)
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=1.0, st…



HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

When CaseStudy is instantiated, compchart is also instantiated with its own attributes.

print (casestudy.compchart)
<casestudy.see19.see19.charts.CompChart2D object at 0x32dee3950>

In particular, all the various available categories are automatically provided labels via the label attribute. A few are shown below for illustration purposes.

for k,v in casestudy.compchart.labels.items():
    print ('{}: {}'.format(k, v))
    if k == 'temp':
        break
cases_dma: Cumulative Cases (3DMA)
cases_new: Daily Cases
cases_new_dma: Daily Cases (3DMA)
deaths_dma: Cumulative Deaths (3DMA)
deaths_new: Daily Deaths
deaths_new_dma: Daily Deaths (3DMA)
tests_dma: Cumulative Tests (3DMA)
tests_new: Daily Tests
tests_new_dma: Daily Tests (3DMA)
cases: Cumulative Cases
deaths: Cumulative Deaths
tests: Cumulative Tests
cases_dma_per_1K: Cumulative Cases per 1K (3DMA)
cases_dma_per_1M: Cumulative Cases per 1M (3DMA)
cases_dma_per_person_per_land_KM2: Cumulative Cases / Person / Land KM² (3DMA)
cases_dma_per_person_per_city_KM2: Cumulative Cases / Person / City KM² (3DMA)
cases_new_per_1K: Daily Cases per 1K
cases_new_per_1M: Daily Cases per 1M
cases_new_per_person_per_land_KM2: Daily Cases / Person / Land KM²
cases_new_per_person_per_city_KM2: Daily Cases / Person / City KM²
cases_new_dma_per_1K: Daily Cases per 1K (3DMA)
cases_new_dma_per_1M: Daily Cases per 1M (3DMA)
cases_new_dma_per_person_per_land_KM2: Daily Cases / Person / Land KM² (3DMA)
cases_new_dma_per_person_per_city_KM2: Daily Cases / Person / City KM² (3DMA)
deaths_dma_per_1K: Cumulative Deaths per 1K (3DMA)
deaths_dma_per_1M: Cumulative Deaths per 1M (3DMA)
deaths_dma_per_person_per_land_KM2: Cumulative Deaths / Person / Land KM² (3DMA)
deaths_dma_per_person_per_city_KM2: Cumulative Deaths / Person / City KM² (3DMA)
deaths_new_per_1K: Daily Deaths per 1K
deaths_new_per_1M: Daily Deaths per 1M
deaths_new_per_person_per_land_KM2: Daily Deaths / Person / Land KM²
deaths_new_per_person_per_city_KM2: Daily Deaths / Person / City KM²
deaths_new_dma_per_1K: Daily Deaths per 1K (3DMA)
deaths_new_dma_per_1M: Daily Deaths per 1M (3DMA)
deaths_new_dma_per_person_per_land_KM2: Daily Deaths / Person / Land KM² (3DMA)
deaths_new_dma_per_person_per_city_KM2: Daily Deaths / Person / City KM² (3DMA)
tests_dma_per_1K: Cumulative Tests per 1K (3DMA)
tests_dma_per_1M: Cumulative Tests per 1M (3DMA)
tests_dma_per_person_per_land_KM2: Cumulative Tests / Person / Land KM² (3DMA)
tests_dma_per_person_per_city_KM2: Cumulative Tests / Person / City KM² (3DMA)
tests_new_per_1K: Daily Tests per 1K
tests_new_per_1M: Daily Tests per 1M
tests_new_per_person_per_land_KM2: Daily Tests / Person / Land KM²
tests_new_per_person_per_city_KM2: Daily Tests / Person / City KM²
tests_new_dma_per_1K: Daily Tests per 1K (3DMA)
tests_new_dma_per_1M: Daily Tests per 1M (3DMA)
tests_new_dma_per_person_per_land_KM2: Daily Tests / Person / Land KM² (3DMA)
tests_new_dma_per_person_per_city_KM2: Daily Tests / Person / City KM² (3DMA)
cases_per_1K: Cumulative Cases per 1K
cases_per_1M: Cumulative Cases per 1M
cases_per_person_per_land_KM2: Cumulative Cases / Person / Land KM²
cases_per_person_per_city_KM2: Cumulative Cases / Person / City KM²
deaths_per_1K: Cumulative Deaths per 1K
deaths_per_1M: Cumulative Deaths per 1M
deaths_per_person_per_land_KM2: Cumulative Deaths / Person / Land KM²
deaths_per_person_per_city_KM2: Cumulative Deaths / Person / City KM²
tests_per_1K: Cumulative Tests per 1K
tests_per_1M: Cumulative Tests per 1M
tests_per_person_per_land_KM2: Cumulative Tests / Person / Land KM²
tests_per_person_per_city_KM2: Cumulative Tests / Person / City KM²
cases_dma_lognat: Cumulative Cases (3DMA)
(Natural Log)
cases_new_lognat: Daily Cases
(Natural Log)
cases_new_dma_lognat: Daily Cases (3DMA)
(Natural Log)
deaths_dma_lognat: Cumulative Deaths (3DMA)
(Natural Log)
deaths_new_lognat: Daily Deaths
(Natural Log)
deaths_new_dma_lognat: Daily Deaths (3DMA)
(Natural Log)
tests_dma_lognat: Cumulative Tests (3DMA)
(Natural Log)
tests_new_lognat: Daily Tests
(Natural Log)
tests_new_dma_lognat: Daily Tests (3DMA)
(Natural Log)
cases_lognat: Cumulative Cases
(Natural Log)
deaths_lognat: Cumulative Deaths
(Natural Log)
tests_lognat: Cumulative Tests
(Natural Log)
cases_dma_per_1K_lognat: Cumulative Cases per 1K (3DMA)
(Natural Log)
cases_dma_per_1M_lognat: Cumulative Cases per 1M (3DMA)
(Natural Log)
cases_dma_per_person_per_land_KM2_lognat: Cumulative Cases / Person / Land KM² (3DMA)
(Natural Log)
cases_dma_per_person_per_city_KM2_lognat: Cumulative Cases / Person / City KM² (3DMA)
(Natural Log)
cases_new_per_1K_lognat: Daily Cases per 1K
(Natural Log)
cases_new_per_1M_lognat: Daily Cases per 1M
(Natural Log)
cases_new_per_person_per_land_KM2_lognat: Daily Cases / Person / Land KM²
(Natural Log)
cases_new_per_person_per_city_KM2_lognat: Daily Cases / Person / City KM²
(Natural Log)
cases_new_dma_per_1K_lognat: Daily Cases per 1K (3DMA)
(Natural Log)
cases_new_dma_per_1M_lognat: Daily Cases per 1M (3DMA)
(Natural Log)
cases_new_dma_per_person_per_land_KM2_lognat: Daily Cases / Person / Land KM² (3DMA)
(Natural Log)
cases_new_dma_per_person_per_city_KM2_lognat: Daily Cases / Person / City KM² (3DMA)
(Natural Log)
deaths_dma_per_1K_lognat: Cumulative Deaths per 1K (3DMA)
(Natural Log)
deaths_dma_per_1M_lognat: Cumulative Deaths per 1M (3DMA)
(Natural Log)
deaths_dma_per_person_per_land_KM2_lognat: Cumulative Deaths / Person / Land KM² (3DMA)
(Natural Log)
deaths_dma_per_person_per_city_KM2_lognat: Cumulative Deaths / Person / City KM² (3DMA)
(Natural Log)
deaths_new_per_1K_lognat: Daily Deaths per 1K
(Natural Log)
deaths_new_per_1M_lognat: Daily Deaths per 1M
(Natural Log)
deaths_new_per_person_per_land_KM2_lognat: Daily Deaths / Person / Land KM²
(Natural Log)
deaths_new_per_person_per_city_KM2_lognat: Daily Deaths / Person / City KM²
(Natural Log)
deaths_new_dma_per_1K_lognat: Daily Deaths per 1K (3DMA)
(Natural Log)
deaths_new_dma_per_1M_lognat: Daily Deaths per 1M (3DMA)
(Natural Log)
deaths_new_dma_per_person_per_land_KM2_lognat: Daily Deaths / Person / Land KM² (3DMA)
(Natural Log)
deaths_new_dma_per_person_per_city_KM2_lognat: Daily Deaths / Person / City KM² (3DMA)
(Natural Log)
tests_dma_per_1K_lognat: Cumulative Tests per 1K (3DMA)
(Natural Log)
tests_dma_per_1M_lognat: Cumulative Tests per 1M (3DMA)
(Natural Log)
tests_dma_per_person_per_land_KM2_lognat: Cumulative Tests / Person / Land KM² (3DMA)
(Natural Log)
tests_dma_per_person_per_city_KM2_lognat: Cumulative Tests / Person / City KM² (3DMA)
(Natural Log)
tests_new_per_1K_lognat: Daily Tests per 1K
(Natural Log)
tests_new_per_1M_lognat: Daily Tests per 1M
(Natural Log)
tests_new_per_person_per_land_KM2_lognat: Daily Tests / Person / Land KM²
(Natural Log)
tests_new_per_person_per_city_KM2_lognat: Daily Tests / Person / City KM²
(Natural Log)
tests_new_dma_per_1K_lognat: Daily Tests per 1K (3DMA)
(Natural Log)
tests_new_dma_per_1M_lognat: Daily Tests per 1M (3DMA)
(Natural Log)
tests_new_dma_per_person_per_land_KM2_lognat: Daily Tests / Person / Land KM² (3DMA)
(Natural Log)
tests_new_dma_per_person_per_city_KM2_lognat: Daily Tests / Person / City KM² (3DMA)
(Natural Log)
cases_per_1K_lognat: Cumulative Cases per 1K
(Natural Log)
cases_per_1M_lognat: Cumulative Cases per 1M
(Natural Log)
cases_per_person_per_land_KM2_lognat: Cumulative Cases / Person / Land KM²
(Natural Log)
cases_per_person_per_city_KM2_lognat: Cumulative Cases / Person / City KM²
(Natural Log)
deaths_per_1K_lognat: Cumulative Deaths per 1K
(Natural Log)
deaths_per_1M_lognat: Cumulative Deaths per 1M
(Natural Log)
deaths_per_person_per_land_KM2_lognat: Cumulative Deaths / Person / Land KM²
(Natural Log)
deaths_per_person_per_city_KM2_lognat: Cumulative Deaths / Person / City KM²
(Natural Log)
tests_per_1K_lognat: Cumulative Tests per 1K
(Natural Log)
tests_per_1M_lognat: Cumulative Tests per 1M
(Natural Log)
tests_per_person_per_land_KM2_lognat: Cumulative Tests / Person / Land KM²
(Natural Log)
tests_per_person_per_city_KM2_lognat: Cumulative Tests / Person / City KM²
(Natural Log)
cases_dma_log: Cumulative Cases (3DMA)
(Log Base 10)
cases_new_log: Daily Cases
(Log Base 10)
cases_new_dma_log: Daily Cases (3DMA)
(Log Base 10)
deaths_dma_log: Cumulative Deaths (3DMA)
(Log Base 10)
deaths_new_log: Daily Deaths
(Log Base 10)
deaths_new_dma_log: Daily Deaths (3DMA)
(Log Base 10)
tests_dma_log: Cumulative Tests (3DMA)
(Log Base 10)
tests_new_log: Daily Tests
(Log Base 10)
tests_new_dma_log: Daily Tests (3DMA)
(Log Base 10)
cases_log: Cumulative Cases
(Log Base 10)
deaths_log: Cumulative Deaths
(Log Base 10)
tests_log: Cumulative Tests
(Log Base 10)
cases_dma_per_1K_log: Cumulative Cases per 1K (3DMA)
(Log Base 10)
cases_dma_per_1M_log: Cumulative Cases per 1M (3DMA)
(Log Base 10)
cases_dma_per_person_per_land_KM2_log: Cumulative Cases / Person / Land KM² (3DMA)
(Log Base 10)
cases_dma_per_person_per_city_KM2_log: Cumulative Cases / Person / City KM² (3DMA)
(Log Base 10)
cases_new_per_1K_log: Daily Cases per 1K
(Log Base 10)
cases_new_per_1M_log: Daily Cases per 1M
(Log Base 10)
cases_new_per_person_per_land_KM2_log: Daily Cases / Person / Land KM²
(Log Base 10)
cases_new_per_person_per_city_KM2_log: Daily Cases / Person / City KM²
(Log Base 10)
cases_new_dma_per_1K_log: Daily Cases per 1K (3DMA)
(Log Base 10)
cases_new_dma_per_1M_log: Daily Cases per 1M (3DMA)
(Log Base 10)
cases_new_dma_per_person_per_land_KM2_log: Daily Cases / Person / Land KM² (3DMA)
(Log Base 10)
cases_new_dma_per_person_per_city_KM2_log: Daily Cases / Person / City KM² (3DMA)
(Log Base 10)
deaths_dma_per_1K_log: Cumulative Deaths per 1K (3DMA)
(Log Base 10)
deaths_dma_per_1M_log: Cumulative Deaths per 1M (3DMA)
(Log Base 10)
deaths_dma_per_person_per_land_KM2_log: Cumulative Deaths / Person / Land KM² (3DMA)
(Log Base 10)
deaths_dma_per_person_per_city_KM2_log: Cumulative Deaths / Person / City KM² (3DMA)
(Log Base 10)
deaths_new_per_1K_log: Daily Deaths per 1K
(Log Base 10)
deaths_new_per_1M_log: Daily Deaths per 1M
(Log Base 10)
deaths_new_per_person_per_land_KM2_log: Daily Deaths / Person / Land KM²
(Log Base 10)
deaths_new_per_person_per_city_KM2_log: Daily Deaths / Person / City KM²
(Log Base 10)
deaths_new_dma_per_1K_log: Daily Deaths per 1K (3DMA)
(Log Base 10)
deaths_new_dma_per_1M_log: Daily Deaths per 1M (3DMA)
(Log Base 10)
deaths_new_dma_per_person_per_land_KM2_log: Daily Deaths / Person / Land KM² (3DMA)
(Log Base 10)
deaths_new_dma_per_person_per_city_KM2_log: Daily Deaths / Person / City KM² (3DMA)
(Log Base 10)
tests_dma_per_1K_log: Cumulative Tests per 1K (3DMA)
(Log Base 10)
tests_dma_per_1M_log: Cumulative Tests per 1M (3DMA)
(Log Base 10)
tests_dma_per_person_per_land_KM2_log: Cumulative Tests / Person / Land KM² (3DMA)
(Log Base 10)
tests_dma_per_person_per_city_KM2_log: Cumulative Tests / Person / City KM² (3DMA)
(Log Base 10)
tests_new_per_1K_log: Daily Tests per 1K
(Log Base 10)
tests_new_per_1M_log: Daily Tests per 1M
(Log Base 10)
tests_new_per_person_per_land_KM2_log: Daily Tests / Person / Land KM²
(Log Base 10)
tests_new_per_person_per_city_KM2_log: Daily Tests / Person / City KM²
(Log Base 10)
tests_new_dma_per_1K_log: Daily Tests per 1K (3DMA)
(Log Base 10)
tests_new_dma_per_1M_log: Daily Tests per 1M (3DMA)
(Log Base 10)
tests_new_dma_per_person_per_land_KM2_log: Daily Tests / Person / Land KM² (3DMA)
(Log Base 10)
tests_new_dma_per_person_per_city_KM2_log: Daily Tests / Person / City KM² (3DMA)
(Log Base 10)
cases_per_1K_log: Cumulative Cases per 1K
(Log Base 10)
cases_per_1M_log: Cumulative Cases per 1M
(Log Base 10)
cases_per_person_per_land_KM2_log: Cumulative Cases / Person / Land KM²
(Log Base 10)
cases_per_person_per_city_KM2_log: Cumulative Cases / Person / City KM²
(Log Base 10)
deaths_per_1K_log: Cumulative Deaths per 1K
(Log Base 10)
deaths_per_1M_log: Cumulative Deaths per 1M
(Log Base 10)
deaths_per_person_per_land_KM2_log: Cumulative Deaths / Person / Land KM²
(Log Base 10)
deaths_per_person_per_city_KM2_log: Cumulative Deaths / Person / City KM²
(Log Base 10)
tests_per_1K_log: Cumulative Tests per 1K
(Log Base 10)
tests_per_1M_log: Cumulative Tests per 1M
(Log Base 10)
tests_per_person_per_land_KM2_log: Cumulative Tests / Person / Land KM²
(Log Base 10)
tests_per_person_per_city_KM2_log: Cumulative Tests / Person / City KM²
(Log Base 10)
: January 2020
population: Population
land_dens: Density of Land Area
city_dens: Population Density of Largest City
uvb: UV-B Radiation in J / M²
rhum: Relative Humidity
strindex: Oxford Stringency Index
visitors: Annual Visitors
visitors_%: Annual Visitors as % of Population
gdp: Gross Domestic Product
gdp_%: Gross Domestic Product per Capita
retail_n_rec: Change in Retail n Recreation Mobility
transit: Change in Transit Mobility
workplaces: Change in WorkPlace Mobility
residential: Change in Residential Mobility
parks: Change in Parks Mobility
groc_n_pharm: Change in Grocery & Pharmacy Mobility
transit_apple: Change in Transit Mobility - Apple
driving_apple: Change in Driving Mobility - Apple
walking_apple: Change in Walking Mobility - Apple
c1: School Closing
c2: Workplace Closing
c3: Cancel Public Events
c4: Restrictions on Gatherings
c5: Close Public Transport
c6: Stay-at-Home Requirements
c7: Restrictions on Internal Movement
c8: International Travel Controls
e1: Income Support
e2: Debt / Contract Relief
e3: Fiscal Measures
e4: International Support
h1: Public Information Campaigns
h2: Testing Policy
h3: Contact Tracing
h4: Emergency Investment in Health Care
h5: Investment in Vaccines
key3_sum: Sum of Key 3 Categories
key3_sum_earlier: Sum of Key 3 Oxford Stingency Factor Weighted to Earlier Dates
make_sum: Custom Stringency Aggregate
neoplasms: NeoPlasms Fatalities
blood: Blood-based Fatalities
endo: Endocrine Fatalities
mental: Mental Fatalities
nervous: Nervous System Fatalities
circul: Circulatory Fatalities
infectious: Infectious Fatalities
respir: Respiratory Fatalities
digest: Digestive Fatalities
skin: Skin-related Fatalities
musculo: Musculo-skeletal Fatalities
genito: Genitourinary Fatalities
childbirth: Maternal and Childbirth Fatalities
perinatal: Perinatal Fatalities
congenital: Congenital Fatalities
other: Other Fatalities
external: External Fatalities
date: Date
temp: Temperature (°C)

make()

Similar to the main casestudy object, charts are rendered with the make method.

x_category and y_category accept any column header in casestudy.df.

make accepts many optional kwargs. Every effort is made to align these options with matplotlib standards. Appropriate options can be found via the matplotlib api. For example:

All of the above kwargs and many others are share amongst ALL the different see19 Chart Classes.

kwargs = {
    'x_category': 'days',
    'y_category': 'cases_new',
    'width': 12,
    'height': 8,
    'title': {'t': 'Most Impacted Regions in Italy', 'fontsize': 24, 'weight': 'demi'},
    'line_params': {'lw': 4},
    'legend_params': {'fontsize': 14, 'handlelength': 1},
    'xlabel_params': {'fontsize': 18, 'labelpad': 10},
    'ylabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 14},
    'ytick_params': {'labelsize': 14},
    'colors': ['red', 'green', 'blue']
}

casestudy.compchart.make(**kwargs)
Daily Cases

png

An optional regions parameter exists that allows you to further reduce the number of regions presented in the chart. regions accepts a list of region_id, region_code, or region_name in any combination.

Below, we also show that a matplotlib colormap can be provided via palette_base and that the x-axis label can be removed by setting xlabel=False

kwargs = {
    'regions': ['LOM', 'EMI'],
    'x_category': 'date',
    'y_category': 'deaths_new',
    'width': 12,
    'height': 8,
    'title': {'t': 'Lombardia v Emilia-Romagna', 'fontsize': 24, 'weight': 'demi'},
    'line_params': {'lw': 6},
    'legend_params': {'fontsize': 14, 'handlelength': 1},
    'xlabel': False,
    'ylabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 14},
    'ytick_params': {'labelsize': 14},
    'palette_base': 'Accent',
}

casestudy.compchart.make(**kwargs)
Daily Deaths

png

5.2 Daily Fatalities Comparison - 5 Most Impacted Regions

Now we'll look at new cases in the 5 most impacted regions globally in terms of total fatalities.

regions = list(bf.sort_values(by='deaths', ascending=False).region_name.unique())[:5]
casestudy = CaseStudy(bf, regions=regions, start_hurdle=3, start_factor='deaths', count_dma=21, log=True)
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=12.0, style=ProgressStyle(description_width…



HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))
title='5 Most Impacted Regions'

kwargs = {
    'x_category': 'days',
    'y_category': 'deaths_new',
    'width': 12,
    'height': 8,
    'title': {'t': title, 'fontsize': 24, 'weight': 'demi'},
    'line_params': {'lw': 3},
    'legend_params': {'fontsize': 14},
    'xlabel_params': {'fontsize': 18, 'labelpad': 10},
    'ylabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 14},
    'ytick_params': {'labelsize': 14},
    'palette_base': 'Accent',
}
p = casestudy.compchart.make(**kwargs)
Daily Deaths

png

There are major outliers, certainly in the early days that make the graph difficult to read. The lognat adjusted category comes in handy here.

Below we also demonstrate that the regions parameter can be provided to each make to further reduce the regions covered in the chart (for convenience)

kwargs['y_category']= 'deaths_new_dma_per_1M_log'
kwargs['ylabel_params']= {'fontsize': 18, 'labelpad': 10}
kwargs['regions'] = ['France', 'India', 'United Kingdom']

p = casestudy.compchart.make(**kwargs)
Daily Deaths per 1M (21DMA)
(Log Base 10)

png

5.3 Varying the Categories

Oxford Stringency Index

compchart can be used to compare any category or factor in casestudy.df with days or date on the x-axis.

The below chart compares the Oxford Stringency Index for each selected region

regions = ['Germany', 'Spain', 'Taiwan']

casestudy = CaseStudy(
    bf, count_categories='cases_new_per_1M', regions=regions, 
    start_factor='', factors=['strindex']
)
casestudy.make()
kwargs = {
    'x_category': 'date',
    'y_category': 'strindex',
    'width': 12,
    'height': 8,
    'line_params': {'lw': 3},
    'legend_params': {'fontsize': 14},
    'xlabel_params': {'fontsize': 18, 'labelpad': 10},
    'ylabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 14},
    'ytick_params': {'labelsize': 14},
    'palette_base': 'Accent',
}
p = casestudy.compchart.make(**kwargs)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=6.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))


Oxford Stringency Index

png

These graphs work best as time-series but the x_category can also be any other category in casestudy.df. Below we can see that in New York, positive cases have steadily declined even as testing has increased. Texas and Arizona have not had the same success.

regions = ['New York', 'Texas', 'Arizona']

casestudy = CaseStudy(bf, regions=regions, count_dma=21)
casestudy.make()
kwargs = {
    'x_category': 'tests_new_dma_per_1M',
    'y_category': 'cases_new_dma_per_1M',
    'width': 12,
    'height': 8,
    'line_params': {'lw': 3},
    'legend_params': {'fontsize': 14},
    'xlabel_params': {'fontsize': 18, 'labelpad': 10},
    'ylabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 14},
    'ytick_params': {'labelsize': 14},
    'palette_base': 'Accent',
}
p = casestudy.compchart.make(**kwargs)
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=8.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))


Daily Cases per 1M (21DMA)

png

Saving Files

All chart instances in see19 have a save_file option. Simply set that option to True and provide a filename and the file will be saved to yor location of choice.

6. compchart4D - Visualizing Factors in 4D

6.1 From 3D to 4D
6.2 More on the X-Axis
6.3 How Far Can We Take It?

3D charts with color-mapping can be used to explore the impact of various factors in different regions at different times.

Such '4D' maps are often criticized for lack of readability, but they have been a valuable tool for recognizing patterns.

These charts are available in CaseStudy via the compchart4d attribute, which is an instance of the CompChart4D class. The 3D representation shows the count_category for each region on z-axis with each day from the start_hurdle on the y-axis and the individual regions separated on the x-axis.

The 3D chart is a cute trick, but the real power is derived from the color_factor. This maps the color of each 3D bar to the factor one wants to investigate.

CompChart4D object utilizes matplotlib for chart creation.

6.1 From 3D to 4D

Most Impacted Regions - Brazil

First, we get region names from the baseframe, sorting as required.

Then we create the casestudy instance, including several factors that we'll cover in our analysis.

from casestudy.see19.see19 import CaseStudy
regions = bf[bf['country'] == 'Brazil'] \
    .sort_values(by='population', ascending=False) \
    .region_name.unique().tolist()[:20]

factor_dmas={'temp': 3}

casestudy = CaseStudy(
    bf, count_dma=5, 
    factors=['temp', 'c1', 'A65PLUSB', 'A75PLUSB'], factor_dmas=factor_dmas,
    regions=regions, start_hurdle=10, start_factor='cases', lognat=True,
)
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=59.0, style=ProgressStyle(description_width…



HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

4D charts are customizable in precisely the same way as CompChart2D, sharing many of the same keywords. compchart4D utilizes a couple of its own unique keywords as per below:

  • z_category is utilized to determine the z-axis (vertical). x- and y-axis are automatically set to regions and days.
  • comp_size will further trim the number of regions by ranking them on the comp_category.
  • a separate rank_category can be provided for this process if preferred
kwargs = {
    'title': {'s': 'Most Impacted Regions in Brazil', 'x': .47, 'y': .74, 'fontsize': 24, 'rotation': -9, 'weight': 'demi'},
    'ylabel_params': {'fontsize': 18, 'labelpad': 12},
    'zlabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 18},
    'ytick_params': {'labelsize': 12},
    'tight': True, 'comp_size': 10,
}
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)

png

df_chart: for most charts, the casestudy dataframe is morphed for presentation purposes. This morphed data is avaliable via the df_chart attribute.

casestudy.compchart4d.df_chart.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id region_name region_code country date days deaths_new_dma_per_1M
10585 566 Ceara CE Brazil 2020-03-22 6 days 0.000000
10586 566 Ceara CE Brazil 2020-03-23 7 days 0.000000
10587 566 Ceara CE Brazil 2020-03-24 8 days 0.000000
10588 566 Ceara CE Brazil 2020-03-25 9 days 0.000000
10589 566 Ceara CE Brazil 2020-03-26 10 days 0.169566

Adding a Color Factor

By adding the color_factor attribute, we can see the impact, if any, of an exogenous factor on the comp_category over time.

We will start with A65PLUSB_%. As this a time-static factor, the color for each region will be the same regardless of the day.

You must provide additional options to position the color bar.

kwargs = {
    **kwargs,
    'color_category': 'A65PLUSB_%', 
    'xy_cbar': (0.09, .225), 'wh_cbar': (.015, 14),
    'cblabel_params': {'labelpad': -55},
}
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)

png

Now we'll use temp, which is a time-dynamic factor and will provide a different color for each region on each day.

kwargs = {**kwargs, 
    'color_category': 'temp',
}
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)

png

Fixing the Color Range

NOTE: The range of colors is automatically set by make. This can be somewhat misleading when:

  1. comparing multiple charts
  2. when a single chart has temperatures in a narrow range. In the above example, for instance, temperatures range only between 18C - 28C and, yet, the color map runs almost the entire red-blue spectrum.

Thus, there is a color_interval option that allows you to fix the color interval. color_interval expects a tuple, where the first item is the low-end of the range and the second item is the high-end.

Fixing the color interval provides a very different picture of Brazil's impacted regions.

kwargs = {**kwargs, 'color_interval': (20,30)}
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)

png

6.2 More on the X-Axis

Top 30 US States

Now we investigate the Top 30 most impacted US states.

regions = bf[bf['country_code'] == 'USA'] \
    .sort_values('cases', ascending='False') \
    .region_name.unique().tolist()[:50]
countries = 'USA'
casestudy = CaseStudy(
    bf, regions=regions, countries=countries, count_dma=14,
    factors=['temp', 'uvb', 'rhum', 'A65PLUSB', 'A75PLUSB', 'A05_24B'], factor_dmas={'temp': 14, 'uvb': 14},
    start_hurdle=10, start_factor='cases', 
)
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=139.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))

Here 4 charts are prepared in quick succession.

Additional options are shown for editing the background grey and removing gridlines.

NOTE: CompChart4D automatically sorts the regions on the x-axis such that the regions with the greatest z-axis values are furthest away. This improves readability.

kwargs = {
    'regions': '',
    'ylabel_params': {'fontsize': 18, 'labelpad': 12},
    'zlabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 12},
    'ytick_params': {'labelsize': 12},
    'ztick_params': {'labelsize': 12},
    'title': {'x': 0.58, 'y': 0.825,'s': 'Daily Deaths in Select US States', 'fontsize': 22, 'rotation': -10.7},
    'xy_cbar': (0.09, .225), 'wh_cbar': (.01, 20),
    'title': {'s': 'Most Impacted States in US', 'x': .47, 'y': .74, 'fontsize': 24, 'rotation': -9, 'weight': 'demi'},
    'cblabel_params': {'labelpad': -55},
    'color_category': 'temp_dma', 'color_interval': (20,30),
    'tight': True,
    'comp_size': 30,
    'rank_category': 'deaths_new_dma_per_1M',    
}

p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_person_per_city_KM2', **kwargs)

kwargs['color_category'] = 'uvb_dma'
kwargs['color_interval'] = ()
kwargs['gridlines'] = False

p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_person_per_city_KM2', **kwargs)

png

png

png

png

6.3 How Far Can We Take It?

101 Most Impacted Regions Globally

I acknowledge that using the chart in this way stretches its value, however, it is has been a great way for me to consider trends globally. Try not to look at each individual region ... look at it more like a scatter plot and see what patterns you can identify, if any.

NOTE: If the number of regions exceeds 100, the region labels are removed automatically.

First, we sort the regions in the baseframe to find the 101 most populous.

Then, those regions are ranked on the comp_category.

compsize = 102
regions = bf[~(bf['country'] == 'China')].sort_values(by='population', ascending=False).region_name.unique().tolist()[:compsize]

factors = ['temp']
factor_dmas = {'temp': 7}

casestudy = CaseStudy(
    bf, regions=regions, factors=factors, factor_dmas=factor_dmas,
    start_hurdle=10, start_factor='cases', count_dma=3, lognat=True
)
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=226.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=103.0), HTML(value='')))
kwargs = {
    'ylabel_params': {'fontsize': 18, 'labelpad': 12},
    'zlabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 12},
    'ytick_params': {'labelsize': 12},
    'ztick_params': {'labelsize': 12},
    'title': {'x': 0.58, 'y': 0.825,'s': 'Daily Deaths Globally', 'fontsize': 22, 'rotation': -10.7},
    'xy_cbar': (0.09, .225), 'wh_cbar': (.01, 20),
    'title': {'s': 'Most Impacted Regions Totally', 'x': .47, 'y': .74, 'fontsize': 24, 'rotation': -9, 'weight': 'demi'},
    'cblabel_params': {'labelpad': -55},
    'color_category': 'temp_dma', 'color_interval': (20,30),
    'tight': True,
    'comp_size': 102,
    'rank_category': 'deaths_new_dma_per_1M', 
}

p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)

png

Now, if temperature for some reason did impact the fatality rate associated with COVID19, what we would expect to see is regions at the far end of the x-axis would tend toward the blue end of the color spectrum and regions at the near end of the x-axis would tend towards red.

We would also expect to see regions with higher peaks to have more blue bars on the near-end of the y-axis, or at times earlier in the outbreak.

7. heatmap - Visualizing with Color Maps

7.1 Count Category v Single Factor
7.2 Count Category v Multiple Factors

Hexbins?

See19 utilizes the hexbin module of matplotlib to generate HeatMap-style charts to investigate the impact of different factors on COVID19 virulence.

This is a bit of a repurpose or basterdization from hexbin's intended usage. hexbin is more commonly used as a 2D histogram for very large datasets, counting the appearance of datapoints within a range of certain (x,y) coordinates (called bins) and then mapping a color scheme to the range of counts.

For our purposes, use of hexbin is a stylistic choice, with the patterns developed more interesting and a bit more revealing than a scatter plot. The intention is for each bin to contain only one datapoint and the color is mapped to either the x-axis values or a 3rd dimension of values.

Structure

As with previous charts, heatmaps are available in CaseStudy via the heatmap attribute, which is in turn an instance of the HeatMap class.

Charts are generated via the make method, which further morphs casestudy.df to arrange data for visualization.

Average over Time v Daily Points

All of the analysis to this point has considered each daily datapoint for each region separately. heatmap is different. heatmap takes (at this point) a simple mean of the x_category and y_category in question. This is a sufficient method to explore potential relationships, but true time series analysis must also be considered to project COVID19 virulence forward.

While the average is used, the timing of such average can still have an impact on the relevance of the analysis. At this stage, heatmap is capable of utilizing the daily moving average from the date of the peak of the x_category or from the date the region clears the start_hurdle.

This option is denoted as the x_start and color_start parameters in the make method.

For this analysis, we need a large dataset, so will start with the top 250 regions in terms of population and we will add many different factors.

excluded_countries = ['China']
excluded_regions = []

frame_filter = (~bf['country'].isin(excluded_countries)) & (~bf['region_name'].isin(excluded_regions))
regions = bf[frame_filter] \
    .sort_values('population', ascending=False) \
    .region_name.unique().tolist()[:250]

factors_with_dmas = CaseStudy.MSMTS + ['strindex']
factor_dmas = {factor: 28 for factor in factors_with_dmas}
factor_dmas['strindex'] = 14
factors = factors_with_dmas + CaseStudy.MAJOR_CAUSES + ['visitors', 'A75PLUSB', 'A65PLUSB', 'gdp']

casestudy = CaseStudy(
    bf, regions=regions, count_dma=14, factors=factors, 
    factor_dmas=factor_dmas, start_hurdle=1, start_factor='deaths', log=True, lognat=True,
)
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=548.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=230.0), HTML(value='')))

7.1 Count Category v Single Factor

`heatmap` takes a similar set of options as `comp_chart` and `comp_chart4d`. The biggest difference in approach relates to text annotations:
  • In comp_chart and comp_chart4d, specific variables for title, subtitle, etc. generate text boxes for specific purposes.
  • In heatmap this is replaced in favor of a more flexible approach of ad-hoc text annotations via the annotations parameter.
  • heatmap has tended to require more lengthy notations / explanations and so this approach seemed more appropriate.

In addition to the standard comp_category, the x-axis of heatmap is now provided by the comp_factor parameter.

The below chart is completed on a linear scale of daily fatalities. It hints at a potential relationship between fatalities and temperature for the most impacted regions, however, the scaling is negatively impacted by a handful of outliers.

NOTE: color_factor is not provided, therefore, the color map is a function of the comp_factor values (on the x-axis).

Max Fatalities v Temperature

title = 'Max Daily Fatalities v Temperature by Region'
subtitle = '*Average temperature for two weeks prior to day of 3rd fatality'
note = '**{} Regions considered excluding mainland China'.format(casestudy.df.region_id.unique().shape[0])
kwargs = {
    'x_category': 'deaths_new_dma_per_1M',
    'y_category': 'temp_dma',
    'annotations': [
        [0, 1.09, title, {'color': 'black', 'fontsize': 16, 'ha': 'left', 'va': 'center',}],
        [0, 1.05, subtitle, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
        [0, 1.01, note, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
    ],
    'xtick_params': {'size': 12},
    'ytick_params': {'size': 12},
    'xlabel_params': {'size': 12},
    'ylabel_params': {'size': 16},
    'width': 12, 'height': 8,
}
plt = casestudy.heatmap.make(**kwargs)

png

The root data for the chart is available via df_chart attribute.

casestudy.heatmap.df_chart.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id region_name temp_dma deaths_new_dma_per_1M
9 52 Idaho 20.192015 0.428860
69 312 Bahrain 33.111273 0.274820
48 98 Nebraska 26.321220 0.240344
214 563 Mato Grosso Do Sul 23.137148 0.224056
219 568 Sergipe 26.239815 0.215220

Natural Log of Max Fatalities v Temperature

By taking the natural log of the fatality rate, we can scale the figure to reveal a more (potentially) clear relationship.

Viewers often struggle to understand the scaling of a natural log, so an hlines option has been provided that will create horizontal lines at the y-values input. hlines requires a list of y-values.

Text annotations are then included to inform of the unscaled comp_category value at each hline.

We also provide comp_factor_start: as max, which puts to use the 28DMA on the day of peak fatalitiy rate for each region.

title = 'Max Daily Fatalities v Temperature by Region'
kwargs = {
    'x_category': 'deaths_new_dma_per_1M_log',
    'y_category': 'temp_dma',
    'x_start': 'start_hurdle',
    'annotations': [
        [0, 1.09, title, {'color': 'black', 'fontsize': 16, 'ha': 'left', 'va': 'center',}],
        [0, 1.05, subtitle, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
        [0, 1.01, note, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
    ],
    'xtick_params': {'size': 12},
    'ytick_params': {'size': 12},
    'xlabel_params': {'size': 12, 'labelpad': 10},
    'ylabel_params': {'size': 16},
    'width': 12, 'height': 8,
}
plt = casestudy.heatmap.make(**kwargs)

png

As with the other chart instances, a chart-specific dataframe can be access for heatmap via the df_hm attribute.

casestudy.heatmap.df_chart.head(4)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_id region_name temp_dma deaths_new_dma_per_1M_log
9 52 Idaho 20.192015 -0.367684
69 312 Bahrain 33.111273 -0.560952
48 98 Nebraska 26.321220 -0.619168
214 563 Mato Grosso Do Sul 23.137148 -0.649644

Lognat of Max Daily New Fatalities and UVB Radition

title = 'Max Daily Fatalities v UVB Radiation by Region'
subtitle = '*Color-mapped by average daily uvb radiation for two weeks prior to the day of max fatalities'
kwargs = {
    'x_category': 'cases_new_dma_per_person_per_city_KM2_log',
    'y_category': 'uvb_dma',
    'x_start': 'max',
    'annotations': [
        [0, 1.09,  title, {'color': 'black', 'fontsize': 16, 'ha': 'left', 'va': 'center',}],
        [0, 1.05, subtitle, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
    ],
    'xtick_params': {'size': 12},
    'ytick_params': {'size': 12},
    'xlabel_params': {'size': 12, 'labelpad': 10},
    'ylabel_params': {'size': 16},
    'width': 12, 'height': 8,
}
plt = casestudy.heatmap.make(**kwargs)

png

7.2 Count Category v Multiple Factors (w one factor color-mapped)

The heatmap is made all the more powerful when a second factor is used to map the color space of the chart.

This is done via the color_factor parameter, which can be adapted via the color_factor_start parameter to take place on the day the start_hurdle is cleared or the day of max count category.

title = 'Max Daily Fatalities v UVB Radiation v Oxford Stringency Index'
subtitle = '*Average UVB radiation and Oxford Stringency Index for two weeks prior to day of 1st fatality'
kwargs = {
    'x_category': 'cases_new_dma_per_1M_lognat',
    'color_category': 'strindex_dma',
    'color_start': 'start_hurdle',
    'y_category': 'uvb_dma',
    'annotations': [
        [0, 1.09, title, {'color': 'black', 'fontsize': 16, 'ha': 'left', 'va': 'center',}],
        [0, 1.05, subtitle, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
    ],
    'xtick_params': {'size': 12},
    'ytick_params': {'size': 12},
    'xlabel_params': {'size': 12, 'labelpad': 10},
    'ylabel_params': {'size': 16},
    'width': 12, 'height': 8,
}
plt = casestudy.heatmap.make(**kwargs)

png

The heatmap approach is even better suited to time-static variables like demographic age ranges, given they are not susceptible to issues around averages over time.

Below we compare A75PLUBB_% against the average strindex for the 14 days prior to the max fatalitiy rate.

We can see that social distancing stringency was quite common across the spectrum and that population age was a much more important variable impacting fatalities.

title = 'Max Daily Fatalities v UVB Radiation v Oxford Stringency Index'
subtitle = '*Average UVB radiation and Oxford Stringency Index for two weeks prior to day of 1st fatality'
note = '**Excludes mainland China'

kwargs = {
    'x_category': 'deaths_new_dma_per_person_per_city_KM2_lognat',
    'y_category': 'A75PLUSB_%',
    'color_category': 'strindex_dma',
    'color_start': 'max',
    'annotations': [
        [0, 1.095, title, {'color': 'black', 'fontsize': 16, 'ha': 'left', 'va': 'center',}],
        [0, 1.055, subtitle, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
        [0, 1.015, note, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
    ],
    'xtick_params': {'size': 12},
    'ytick_params': {'size': 12},
    'xlabel_params': {'size': 12, 'labelpad': 10},
    'ylabel_params': {'size': 16},
    'width': 12, 'height': 8,
}
plt = casestudy.heatmap.make(**kwargs)

png

8. barcharts - Comparing Regional Factors

A barcharts attribute is available (via BarCharts class) as another handy feature for comparing the impact in different regions across different categories.

The object plots a single category on a single plot comparing multiple regions. You can provide multiple categories and multiple subplots will be returned!

barcharts object utilizes matplotlib.

First instantiate the casestudy. We will consider a couple of the more successful Asian regions.

dragons = ['Hong Kong', 'Taiwan', 'Korea, South', 'Japan']
notables = [ 'Texas', 'New York', 'Lombardia', 'Sao Paulo']
regions = notables + dragons

factors_with_dmas = ['uvb', 'temp'] + CaseStudy.STRINDEX_CATS
factor_dmas = {factor: 28 for factor in factors_with_dmas}
mobi_dmas = {'transit': 28, 'retail_n_rec': 28, 'parks': 28, 'workplaces': 28}
factors = factors_with_dmas + CaseStudy.GMOBIS + ['A15_34B', 'A65PLUSB'] \
    + ['visitors', 'gdp'] + CaseStudy.MAJOR_CAUSES

casestudy = CaseStudy(
    bf, regions=regions, count_dma=21, factors=factors, factor_dmas=factor_dmas, 
    mobi_dmas=mobi_dmas, start_hurdle=1, start_factor='deaths',
    favor_earlier=True, factors_to_favor_earlier='key3_sum',
)
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=20.0, style=ProgressStyle(description_width…



HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))

Barcharts accepts any category in the see19 dataset bar_colors provides different coloring of groups in the chart. You can further indicate some feature regions. Below we see a start difference among the regions selected.

factors1 = ['cases_per_1M', 'deaths_per_1M']
kwargs = {'categories': factors1, 'height': 5, 'bar_colors': ['#3D7068', '#D4AFB9', '#529FD7']}
kwargs['feature_regions'] = ['HKG', 'TWN', 'KOR']
plt = casestudy.barcharts.make(**kwargs)

png

Once again, the chart data is available via df_chart:

casestudy.barcharts.df_chart
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
region_code NY SP LOM TX JPN KOR HKG TWN
region_id 75 556 36 67 429 433 353 497
region_code NY SP LOM TX JPN KOR HKG TWN
cases 407326 416434 95548 332434 25706 13816 1655 451
deaths 25056 19788 16796 4020 988 296 10 7
tests 5.16481e+06 1.15885e+06 724365 2.98455e+06 639821 1.44335e+06 442256 79506
population 1.93781e+07 4.1142e+07 9.63118e+06 2.51456e+07 1.28057e+08 4.79908e+07 7.02728e+06 2.25314e+07
city_dens 13978.1 8184.1 2316.88 924.007 8440.43 5032.81 9261.85 7919.49
cases_per_1M 21019.9 10121.9 9920.7 13220.4 200.738 287.889 235.511 20.0165
deaths_per_1M 1293.01 480.969 1743.92 159.869 7.71529 6.16785 1.42303 0.310678

barcharts can compare daily case and fatality rates. When a daily figure is selected, barcharts will find the maximum value in the time-series.

factors2 = ['deaths_new_dma_per_1M', 'deaths_new_dma_per_person_per_city_KM2']
kwargs = {'categories': factors2, 'height': 5, 'bar_colors': ['#3D7068', '#D4AFB9', '#529FD7']}
kwargs['feature_regions'] = ['HKG', 'TWN', 'KOR']
plt = casestudy.barcharts.make(**kwargs)

png

As a matter of convenience, barcharts will automatically structure a subplot grid for any number of categories greater than 2.

factors = [
    'strindex_dma', 'tests_new_dma_per_1M', 
    'population', 'city_dens', 
    'A15_34B_%', 'A65PLUSB_%', 
    'temp_dma', 'uvb_dma',
    'circul_%', 'endo_%',
    'visitors_%'
]
factors = factors1 + factors2 + factors
kwargs = {'categories': factors, 'height': 50, 'bar_colors': ['#3D7068', '#D4AFB9', '#529FD7']}
kwargs['title'] = {'t': 'COVID Dragons v Other Regions', 'y': .895, 'fontsize': 20, 'fontweight': 'demi'}
kwargs['feature_regions'] = ['HKG', 'TWN', 'KOR']
plt = casestudy.barcharts.make(**kwargs)

png

9. Scatterflow for Large Sets

9.1 SubStrindexScatter
9.2 ScatterFlow

The plots investigated above have limitations when investigating a large set of subjects. Multi-line plots tend to become unreadable when using more than, say, 5 lines, and bar charts have dimensionality limitations, etc.

The scatterflow and substrinscat charts were created to improve visualization in this case.

9.1 substrinscat - for Strindex Sub-Categories

We will start with substrinscat, which is a more specific case of a scatterflow that focuses on the Oxford Stringency Index (you can think of it as being short for "Sub-Strindex Category Scatterflow").

We can generate a single substrinscat for one region that shows each stringency indicator. The value of the indicator is denoted by the color at each point.

The strindex and its subcategories are tracked at the country-level, so we will instantiate a casestudy setting the country_level flag to true. This aggregates all the see19 data up from the province/state level to the country level (where province/state data exists). As previously noted, smoothing is not available when country_level=True.

NOTE we will also instantiate with start_factor: ''. This creates a dataset beginning on 2020-01-01.

factors = CaseStudy.STRINDEX_CATS
factor_dmas = {factor: 28 for factor in factors}

countries = ['United States of America (the)', 'Canada', 'Mexico', 'Brazil', 'Australia', 'Russia',
 'Italy', 'Germany', 'Spain', 'Singapore', 'Japan', 'Hong Kong', 'TWN', 'KOR', 'Malaysia'
]
custom_sum = ['h1', 'h2', 'h3', 'c1', 'c8']
casestudy = CaseStudy(
    bf, countries=countries, count_dma=21, factors=factors, factor_dmas=factor_dmas, 
    start_hurdle=1, start_factor='', lognat=True, country_level=True, custom_sum=custom_sum,
)
casestudy.make()
/Users/spindicate/Documents/programming/zooscraper/casestudy/see19/see19/study/ray.py:16: UserWarning: smoothing is unavailable when country_level=True
  super().__init__(*args, **kwargs)



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, max=14.0), HTML(value='')))

First, we'll demonstrate a single region, using Japan.

kwargs = {
    'regions': 'Japan', 'width': 6, 'height': 4.5, 
    'title': {'t': 'Japan Stringency Categories', 'x': .57, 'y': 1.07, 'fontsize': 20},
    'xlabel_params': {'fontsize': 18, 'labelpad': 12},
    'cblabel_params': {'fontsize': 14, 'labelpad': 6},
    'palette_base': 'RdPu',
    'xy_cbar': (1.05, .15), 'wh_cbar': (.35, .5),
}
plt = casestudy.substrinscat.make(**kwargs)

png

The single plot above expands to multi-plot simply by adding more regions.

kwargs = {
    'regions': ['name_for_USA', 'Hong Kong', 'Taiwan', 'Korea, South', 'Malaysia'], 
    'width': 14, 'height': 8,
    'palette_base': 'RdPu',
    'xy_cbar': (1.05, .3), 'wh_cbar': (.35, .5),
    'xy_legend': (-.04, .49),
    'legend': {'title': {'fontsize': 12}, 'text': {'fontsize': 12}},
}
plt = casestudy.substrinscat.make(**kwargs)

png

And the plot automatically rescales based on the number of regions considered:

kwargs = {
    'width': 20, 'height': 18, 
    'palette_base': 'RdPu',
    'xy_cbar': (1.05, .3), 'wh_cbar': (.35, .5),
    'xy_legend': (-.04, .51),
    'legend': {'title': {'fontsize': 12}, 'text': {'fontsize': 12}},
}
plt = casestudy.substrinscat.make(**kwargs)

png

9.2 scatterflow

ScatterFlow, available as the scatterflow attribute, is a generalization of the SubStrinScatter chart. It is best suited for comparing many regions along a single dimension. For example, we can compare countries on the core Oxford Stringency Index:

kwargs = {
    'y_category': 'strindex',
    'title': {'t': 'Oxford Stringency Index Over Time', 'y': 0.94, 'fontsize': 16},
    'width': 8, 'height': 6,
    'xy_cbar': (.7, .24), 'wh_cbar': (.35, 1),
    'palette_base': 'Blues',
    'xlabel_params': {'fontsize': 15, 'labelpad': 12},
}

plt = casestudy.scatterflow.make(**kwargs)

png

We can very clearly above the trends in stringency in the different regions above and isolate quickly the outliers.

Scatterflow accepts any category in the see19 database.

Here we show the sum of the Key3 strindex subcategories.

kwargs = {
    'y_category': 'key3_sum',
    'title': {
        't': 'The Key 3: Information, Contact Tracing, and Testing Over Time',
        'fontsize': 16,
        'y': 0.94
    },
    'xlabel_params': {'fontsize': 14},
    'width': 8, 'height': 6,
    'xy_cbar': (.7, .24), 'wh_cbar': (.35, 1),
    'palette_base': 'Blues'
}
plt = casestudy.scatterflow.make(**kwargs)

png

And below we compare US states on new fatalities.

First, we will select the 25 most impacted States in terms of total fatalities. Then, we instantiate a new CaseStudy to do so.

region_ids = bf[bf.country_code == 'USA'].groupby('region_id').deaths.max().sort_values(ascending=False).index.values[:25]
casestudy = CaseStudy(bf, regions=region_ids, count_dma=3,
    start_factor='date', start_hurdle=dt(2020, 3, 1)
)
casestudy.make()
HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=66.0, style=ProgressStyle(description_width…



HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))
kwargs = {
    'y_category': 'deaths_new_dma_per_1M',
    'title': {
        't': 'Daily Fatalities in US States',
        'fontsize': 16,
        'y': 0.94
    },
    'marker': 's',
    'ms': 225,
    'width': 5, 
    'height': 4,
    'xlabel_params': {'fontsize': 14},
    'width': 8, 'height': 6,
    'xy_cbar': (.7, .24), 'wh_cbar': (.35, 1),
    'palette_base': 'RdYlGn_r'
}
casestudy.scatterflow.make(**kwargs)

png

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

see19-0.4a0.tar.gz (116.7 kB view hashes)

Uploaded Source

Built Distribution

see19-0.4a0-py3-none-any.whl (71.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page