An interface for visualizing and analysing the see19 dataset

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

see19 Guide

A dataset and interface for visualizing and analyzing the epidemiology of Coronavirus Disease 2019 aka COVID19 aka C19

Current with version 0.4.0

Analysis

Please read my various deep dives with see19 exploring different aspects of COVID19.

How Effective Is Social Distancing?

What Factors Are Correlated With COVID19 Fatality Rates?

The COVID Dragons

Purpose
Getting Started
the Data
3.1 Data Sources
3.2 Dataset Characteristics
3.3 The Testset
3.4 Disclaimer
the CaseStudy Interface
4.1 Basics
4.2 Filtering
4.3 Smoothing
4.4 Available Factors
4.5 Additional Flags
4.6 RayStudy v BaseStudy
4.7 Chart Objects
compchart - Visualizing Regional Impacts
5.1 Daily Fatalities Comparison - Italy
5.2 Daily Fatalities Comparison - 10 Most Impacted Regions
5.3 Varying the Categories
compchart4D - Visualizing Factors in 4D
6.1 From 3D to 4D
6.2 More on the X-Axis
6.3 How Far Can We Take It?
heatmap - Visualizing with Color Maps
7.1 Count Category v Single Factor
7.2 Count Category v Multiple Factors
barcharts - Comparing Regional Factors
ScatterFlow for Large Sets
9.1 substrinscat - for Strindex Sub-Categories
9.2 scatterflow

1. Purpose

See19 is the single most comprehensive international COVID-19 dataset available.

Ease-of-use is paramount, thus, all data from all sources have been compiled into a single structure, readily consumed and manipulated in the ubiquitous csv format.

Along with the root data, a module is included with analysis and visualizations tools.

2. Getting Started

See19 is a dataset and a python package.

The dataset can be accessed directly here. Files are timestamped with creation date.

The package can be installed via pip.

pip install see19

3. the Data

3.1 Data Sources
3.2 Dataset Characteristics
3.3 The Testset
3.4 Disclaimer

The See19 dataset aggregates global data on COVID19 in various regions, as available data allows, and marries that data with available datasets on exogenous regional factors that might impact the epidemiology of the virus.

The dataset is compiled using Selenium, Django, SQLite, and Pandas.

COVID19 Data Characteristics:

Cumulative Cases for each region on each date
Cumulative Fatalities for each region on each date
State / Provincial-level data available for:
- Australia
- Brazil
- Canada
- China
- Italy
- United States
Country-level available for all other regions

Factor Data Characteristics available for most regions:

Longitude / Latitude
- I just wrote a script that searched the region name on this website and pulled the coordinates from the resulting url
Population
Population demographic segmentation
Land Density
City Density (typically the density of the largest city in the region)
Climate Characteristics including:
- Average daily temperature
- Average daily dewpoint temperate
- Average daily relative humidity (derived from temperature and dewpoint temperature)
- Total daily UV-B Radiation
Air quality measures
Historical Health Outcomes
Travel Popularity
Social Distancing Implementation

Updated each morning.

3.1 Data Sources

COVID Case, Fatality, and Testing Data:

cases and deaths and tests
- Brazil Regional Data compiled via the great from Wesley Cota and team.
- Note: Brazil data was previously available directly from the federal government, however, the fulsome CSV was removed from the site and a new source was required.
- Italy Regional Data from the government github repo
  - Note: Italian testing has two categories that complicate the data somewhat
    - tamponi refers to swabs. Swabs have been recorded since very early on. There are generally multiple swabs per individual whereas most test counts are one test per individual.
    - casi_testati refers to the more standard one test per person. This metric was not reliably tract before mid-April
    - for metrics prior to mid-April, see19 adjusts the tamponi counts by finding the average tamponi per case_testati across the all data then dividing the tampons by the average to estimate casi_testati
cases and deaths
- US Regional Data from the COVID Tracking Project
- Other Regions from Johns Hopkins via humdata.org
tests

Other Data:

Longitude & Latitude
- I just wrote a script that searched each region name on this site
- Any errors were fixed manually
Population, Demographics, and Density from SEDAC
- Matched to regional case data by name, often manually
Climate Data from European Centre for Medium-Range Weather Forecasts
- Climate data pulled from nearest matching longitude & latitude coordinate in the dataset
Air Quality Data from the World Air Quality Project
- Air quality data recorded at city-level, with limited number of cities available
- City data is aggregated to the regional or country-level
- So, where a region has mutiple cities reporting AQ data, the region value is aggregate of the cities
- Where a region has only a single city, that city represents the whole region
- Where a region has no cities, NADA
Social Distancing Stringency Index and Policy Indicators via Oxford Covid Government Response Tracker
Google Mobility Data
Apple Mobility Index
GDP Per Capita via the OECD and WorldBank
- utilizing real 2016 Purchasing Power Parity figures indexed to 2015 US dollars
Causes of Death
- A fairly messy hodgepodge of data for global, US, and Italy
Travel Popularity
- An even messier hodgepodge of data pulled from the World Tourism Organization via indexmundi
- State/Provincial data were derived from the country-level and other various sources in an ad-hoc fashion
- Good travel data is surprisingly difficult to come by. There are a number of services that offer data on flight statistics, however, it is prohibitively expensive

3.2 Dataset Characteristics

With see19 installed, we can download the dataset via get_baseframe

import numpy as np
import pandas as pd

# from see19 import get_baseframe
from casestudy.see19.see19 import get_baseframe
bf = get_baseframe()

HBox(children=(FloatProgress(value=0.0, description='Find latest dataset...', layout=Layout(flex='2'), max=3.0…

The dataset is arranged such that each row is a unique entry for each region_id on each date

All other columns are the value of that particular factor in that particular region on that particular date

bf.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	genito	childbirth	perinatal	congenital	other	external	visitors	travel_year	gdp	gdp_year
0	282	110	ABR	Abruzzo	ITA	Italy	2020-01-01	NaN	NaN	NaN	...	442.0	1.0	16.0	19.0	384.0	2059	181458.0	2017.0	4.560860e+10	2016.0
1	282	110	ABR	Abruzzo	ITA	Italy	2020-01-02	NaN	NaN	NaN	...	442.0	1.0	16.0	19.0	384.0	2059	181458.0	2017.0	4.560860e+10	2016.0
2	282	110	ABR	Abruzzo	ITA	Italy	2020-01-03	NaN	NaN	NaN	...	442.0	1.0	16.0	19.0	384.0	2059	181458.0	2017.0	4.560860e+10	2016.0

3 rows × 132 columns

This could perhaps be more appropriately structured as a multi-index frame, however, I find such indexes cumbersome to work with.

'There are {} unique regions in the dataset'.format(bf.region_id.unique().size)

'There are 325 unique regions in the dataset'

Australia, Brazil, Canada, China, Italy, and the US have state/provincial level data.

For example, regions within Italy and Brazil are as follows:

bf[bf.country.isin(['Italy', 'Brazil'])].region_name.unique()

array(['Abruzzo', 'Acre', 'Alagoas', 'Amapa', 'Amazonas', 'Bahia',
       'Basilicata', 'Calabria', 'Campania', 'Ceara', 'Distrito Federal',
       'Emilia-Romagna', 'Espirito Santo', 'Friuli Venezia Giulia',
       'Goias', 'Lazio', 'Liguria', 'Lombardia', 'Maranhao', 'Marche',
       'Mato Grosso', 'Mato Grosso Do Sul', 'Minas Gerais', 'Molise',
       'P.A. Bolzano', 'P.A. Trento', 'Para', 'Paraiba', 'Parana',
       'Pernambuco', 'Piaui', 'Piemonte', 'Puglia', 'Rio De Janeiro',
       'Rio Grande Do Norte', 'Rio Grande Do Sul', 'Rondonia', 'Roraima',
       'Santa Catarina', 'Sao Paulo', 'Sardegna', 'Sergipe', 'Sicilia',
       'Tocantins', 'Toscana', 'Umbria', "Valle d'Aosta", 'Veneto'],
      dtype=object)

'Each region has {} dates in the dataset'.format(bf.date.unique().size)

'Each region has 202 dates in the dataset'

"""Thus, there are {:,.0f} rows in the dataset, with one row for each unique `region_id`-`date` combination""" \
.format(bf.date.shape[0])

'Thus, there are 65,650 rows in the dataset, with one row for each unique `region_id`-`date` combination'

"""There are currently {} columns in the dataset, most of which are observable factors""".format(bf.columns.size)

'There are currently 132 columns in the dataset, most of which are observable factors'

The factors can be seen as split between two types:

Time-static factors, i.e. do not change by the date.
- population, density, population demographic ranges, cause of death outcomes, travel popularity
Time-dynamic factors, i.e. change with each date.
- fatalities, climate, pollution, mobility, and the Oxford stringency index

They can be found as follows:

ny = bf[bf.region_name == 'New York']

static = []
dynamic = []
for col in ny.columns:
    if ny[col].unique().size > 1:
        dynamic.append(col)
    else:
        static.append(col)

bold = '\033[1m'
end = '\033[0m'
print ('{}***STATIC***{}\n'.format(bold, end), static)
print ('\n')
print ('{}***DYNAMIC***{}\n'.format(bold, end), dynamic)

[1m***STATIC***[0m
 ['region_id', 'country_id', 'region_code', 'region_name', 'country_code', 'country', 'population', 'land_KM2', 'land_dens', 'city_KM2', 'city_dens', 'A00_04B', 'A05_09B', 'A10_14B', 'A15_19B', 'A20_24B', 'A25_29B', 'A30_34B', 'A35_39B', 'A40_44B', 'A45_49B', 'A50_54B', 'A55_59B', 'A60_64B', 'A65_69B', 'A70_74B', 'A75_79B', 'A80_84B', 'A09UNDERB', 'A14UNDERB', 'A19UNDERB', 'A24UNDERB', 'A29UNDERB', 'A34UNDERB', 'A65PLUSB', 'A70PLUSB', 'A75PLUSB', 'A80PLUSB', 'A85PLUSB', 'A05_19B', 'A05_24B', 'A05_29B', 'A05_34B', 'A15_24B', 'A15_29B', 'A15_34B', 'A20_29B', 'A20_34B', 'A35_54B', 'A40_54B', 'A45_54B', 'A35_64B', 'A40_64B', 'A45_64B', 'pm10', 'precipitation', 'wd', 'uvi', 'aqi', 'pol', 'mepaqi', 'pm1', 'e3', 'e4', 'h4', 'h5', 'transit_apple', 'walking_apple', 'year', 'neoplasms', 'blood', 'endo', 'mental', 'nervous', 'circul', 'infectious', 'respir', 'digest', 'skin', 'musculo', 'genito', 'childbirth', 'perinatal', 'congenital', 'other', 'external', 'visitors', 'travel_year', 'gdp', 'gdp_year']


[1m***DYNAMIC***[0m
 ['date', 'cases', 'deaths', 'tests', 'co', 'dew', 'humidity', 'no2', 'o3', 'pm25', 'pressure', 'so2', 'temperature', 'wind gust', 'wind speed', 'wind-gust', 'wind-speed', 'temp', 'dewpoint', 'uvb', 'rhum', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'e1', 'e2', 'h1', 'h2', 'h3', 'strindex', 'retail_n_rec', 'groc_n_pharm', 'parks', 'transit', 'workplaces', 'residential', 'driving_apple']

'The entire set has {:,.0f} different data points'.format(bf.size)

'The entire set has 8,665,800 different data points'

3.3 The Testset

A separate dataset, referred to as the testset, is housed in the see19 repo in the testset folder. The testset will include new data (either additional factors or new regions) that has not yet been incorporated in the see19 interface. The goal is to integrate the new data into the interface over time. The testset will be update concurrently with the main dataset on an adhoc basis.

The existing see19 package is NOT be compatiable with the testset, HOWEVER you can download the testset via get_baseframe by setting test=True.

See the readme for additional data currently available in the testset.

bf_test = get_baseframe(test=True)

HBox(children=(FloatProgress(value=0.0, description='Find latest testset...', layout=Layout(flex='2'), max=3.0…

3.4 Disclaimer

I have said before and it bears repeating: This is an imperfect dataset. Specific problems are highlighted here.

GENERAL ISSUES

Not all factors have available measurements for each region or each date.
- These are typically expressed as NaN
Some factors are available at regional levels while others are not
- Measurements for a region are often compared to other measurements at the country level. This isn't necessarily problematic ... for large geographic and populous countries like the US, it is likely better that state-level data is used to compare to other smaller countries.
- State-level measurements are often estimate by mixing separate data sources. For instance, Visitor data for the provinces of Brazil was estimated by taking the country-level data from the World Tourism Organization and weighting it by the province's proportionate share in visitor travel from separate data from the Brazilian government.
Some data is outdated.
- GDP data lags signficantly particularly for large groups of countries, so 2016 figures have been used, presuming that the relative mix among countries has remained constant

DENSITY

Population density is oft-cited as a potential explanatory factor in COVID19 infection rates. And I couldn't agree more that it is important to consider. However, the study of density suffers from many issues.

Denisty is highly variable within regions. And case and fatality rates have been highly variable within regions and across densities. In New York City, for example, some of the least dense regions have had the highest infection rates.
With only regional data available, to be rigourous the safest option is to simple choose the density of the region. However, this is often a poor reflection of reality. New York State actually has signficant land mass despite most of its population residing on a tiny island on the southeastern edge.
To account for this, See19 includes a factor city_dens. city_dens is the density of the largest city in the region, so :
- for New York State, city_dens is the density of New York City,
- for Taiwan, city_dens is the density of Taipei,
- for Japan, city_dens is the density of Tokyo, and so on.
This approach results in its own issues. For instance, at present, for all of Russia, city_dens reflects the density of Moscow.

Other geographic measurements, such as temperature and uvb radiation suffer from similar issues.

The only true way to address these shortcomings is for daily case and fatality statistics to be released at the county-level (or equivalent) in every country around the globe.

CASE DATA

Aside from just the difficulties of aggregating data, there are well-documented issues with the underlying case and fatality counts as well.

Confirmed cases are likely well below actual cases given up to 50% of all COVID19 cases may be asymptomatic and limited testing in the early stages led to many symptomatic cases going unreported.
The rapid improvement in testing likely exaggerated the growth of infections over time
Fatalities were unreported at peak periods due to lack of health care capacity
Fatalities have been retroactively added to data, without adjusting back to the days the fatalities actually occured, so for regions like Hubei and New York state, there are massive spikes in fatalities that don't reflect the actual experience.
China has been heavily criticized for under-reporting, late-reporting, and recently added ~20% increase in cumulative fatalities on a random day in March. For these reasons, throughout this tutorial, you will see that China is often excluded from the dataset.

TESTING

Testing statistics are still a bit of a mess internationally. For instance, many European countries only report cumulative test counts on a weekly basis and many have only begun reporting in the vary recent past. Different methods of interpolation are available in the CaseStudy interface.

Brazil is not currently included in tests data. Brazil test counts are only currently available on the country level whereas case and fatality data is available on a regional level. Methods are being considered to allocate aggregate tests among the regions (perhaps simply as percentage of population or cases counts).

4. the Casestudy Interface

4.1 Basics
4.2 Filtering
4.3 Smoothing
4.4 Available Factors
4.5 Additional Flags
4.6 RayStudy v BaseStudy
4.7 Chart Objects

See19 Visualization and Data analysis is completed via the CaseStudy class. CaseStudy provides attributes and methods for filtering, manipulating, appending, and visualizing data in the baseframe.

CaseStudy can be accessed directly from the see19 module. To initialize, simply pass the baseframe.

# from see19 import CaseStudy
from casestudy.see19.see19 import CaseStudy
casestudy = CaseStudy(bf)

4.1 Basics

The original baseframe can be accessed via the baseframe attribute

casestudy.baseframe.head(2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	genito	childbirth	perinatal	congenital	other	external	visitors	travel_year	gdp	gdp_year
0	282	110	ABR	Abruzzo	ITA	Italy	2020-01-01	NaN	NaN	NaN	...	442.0	1.0	16.0	19.0	384.0	2059	181458.0	2017.0	4.560860e+10	2016.0
1	282	110	ABR	Abruzzo	ITA	Italy	2020-01-02	NaN	NaN	NaN	...	442.0	1.0	16.0	19.0	384.0	2059	181458.0	2017.0	4.560860e+10	2016.0

2 rows × 132 columns

CaseStudy automatically computes different adjustments including:

Daily new cases, fatalities, and tests (called count_types)
Daily Moving Average (DMA) for new and cumulative count_types
Population and density adjustments for new and cumulative count_types
Daily growth or change in 1. thru 3. above

These adjustments are referred to as count_categories. Additional adjustments are available via kwargs to be discussed below.

Ajustments are added to the dataset by calling the make method. The amended dataset is the accessible via the df attribute.

casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=502.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))

The amended dataframe can be accessed via the df attribute:

casestudy.df.head(2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	growth_cases_per_person_per_city_KM2	growth_deaths_per_1K	growth_deaths_per_1M	growth_deaths_per_person_per_land_KM2	growth_deaths_per_person_per_city_KM2	growth_tests_per_1K	growth_tests_per_1M	growth_tests_per_person_per_land_KM2	growth_tests_per_person_per_city_KM2	days
43906	32	110	TRE	P.A. Trento	ITA	Italy	2020-03-13	216.699585	1.87999	803.712436	...	1.523364	2.0	2.0	2.0	2.0	1.426644	1.426644	1.426644	1.426644	0 days
43907	32	110	TRE	P.A. Trento	ITA	Italy	2020-03-14	273.865733	1.87999	955.714788	...	1.263804	1.0	1.0	1.0	1.0	1.189125	1.189125	1.189125	1.189125	1 days

2 rows × 140 columns

NOTE: Ray and Numba are utilized to significantly improve the speed of make. Ray is not compatible with Windows. CaseStudy will attempt to detect incompatibility and revert to a single-process method where applicable.

More in Section 4.5

For ease of selection, CaseStudy has a number of class attributes with different groupings of count categories: BASECOUNT_CATS, PER_CATS, LOGNAT_CATS, LOG_CATS, ALL_CATS, DMA_COUNT_CATS, PER_COUNT_CATS.

DMA_COUNT_CATS is shown as an example:

CaseStudy.DMA_COUNT_CATS[:10]

['cases_dma',
 'cases_new_dma',
 'deaths_dma',
 'deaths_new_dma',
 'tests_dma',
 'tests_new_dma',
 'cases_dma_per_1K',
 'cases_dma_per_1M',
 'cases_dma_per_person_per_land_KM2',
 'cases_dma_per_person_per_city_KM2']

Both the log10 and natural of each of 1. thru 3. above are available for presentation purposes. Simply provide log=True and/or lognat=True and/or .

casestudy.log = True
casestudy.lognat = True
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=502.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))

casestudy.df[['region_name', 'date'] + [col for col in casestudy.df if 'log' in col]].head(2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_name	date	cases_dma_log	cases_new_log	cases_new_dma_log	deaths_dma_log	deaths_new_log	deaths_new_dma_log	tests_dma_log	tests_new_log	...	growth_cases_per_person_per_land_KM2_lognat	growth_cases_per_person_per_city_KM2_lognat	growth_deaths_per_1K_lognat	growth_deaths_per_1M_lognat	growth_deaths_per_person_per_land_KM2_lognat	growth_deaths_per_person_per_city_KM2_lognat	growth_tests_per_1K_lognat	growth_tests_per_1M_lognat	growth_tests_per_person_per_land_KM2_lognat	growth_tests_per_person_per_city_KM2_lognat
43906	P.A. Trento	2020-03-13	2.186879	1.871859	1.691872	-0.026874	-0.026874	-0.202966	2.794193	2.380851	...	-1.014299	-1.014299	0.890089	2.152714	0.867427	0.867427	4.976355	1.050782	1.304384	1.304384
43907	P.A. Trento	2020-03-14	2.324156	1.757139	1.757139	0.194974	NaN	-0.202966	2.888888	2.181850	...	2.104604	2.104604	1.000000	1.000000	1.000000	1.000000	1.389530	1.023559	1.113758	1.113758

2 rows × 242 columns

'In total, there are {} different `count_categories` to choose from.'.format(len(CaseStudy.ALL_COUNT_CATS))

'In total, there are 180 different `count_categories` to choose from.'

4.2 Filtering

Thankfully, casestudy.df can be limited to specific count categories via the count_categories attribute:

casestudy.count_categories = ['tests_new_dma_per_person_per_land_KM2']
casestudy.make()
casestudy.df.head(2)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=502.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	population	land_KM2	land_dens	city_KM2	city_dens	tests_new_dma_per_person_per_land_KM2	days
43906	32	110	TRE	P.A. Trento	ITA	Italy	2020-03-13	216.699585	1.87999	803.712436	515201.0	2938.79544	175.310262	2938.79544	175.310262	0.807438	0 days
43907	32	110	TRE	P.A. Trento	ITA	Italy	2020-03-14	273.865733	1.87999	955.714788	515201.0	2938.79544	175.310262	2938.79544	175.310262	0.865241	1 days

When passing kwargs to CaseStudy at initialization, most kwargs will accept either a string for a single category or a list (or other iterable) for multiple. When assigning to an instance attribute, an interable must be passed

casestudy = CaseStudy(bf, count_categories='tests_new_dma_per_person_per_land_KM2')
casestudy.make()
casestudy.df[['region_name', 'date', 'tests_new_dma_per_person_per_land_KM2']].head(2)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=502.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_name	date	tests_new_dma_per_person_per_land_KM2
43906	P.A. Trento	2020-03-13	0.807438
43907	P.A. Trento	2020-03-14	0.865241

casestudy.count_categories = ['deaths_new_dma_per_person_per_land_KM2', 'growth_cases_new_per_1M']
casestudy.make()
casestudy.df.head(2)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=502.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	population	land_KM2	land_dens	city_KM2	city_dens	deaths_new_dma_per_person_per_land_KM2	growth_cases_new_per_1M	days
43906	32	110	TRE	P.A. Trento	ITA	Italy	2020-03-13	216.699585	1.87999	803.712436	515201.0	2938.79544	175.310262	2938.79544	175.310262	0.003575	1.866667	0 days
43907	32	110	TRE	P.A. Trento	ITA	Italy	2020-03-14	273.865733	1.87999	955.714788	515201.0	2938.79544	175.310262	2938.79544	175.310262	0.003575	0.767857	1 days

CaseStudy can further filter baseframe as follows:

regions to limit the frame to certain regions
countries to limit the frame to certain countries
exclude_regions to exclude certain regions
exclude_countries to exclude certain countries

Specific regions can be included or excluded by providing the region_name, region_code, or region_id. Specific countries can be included or excluded by providing the country, country_code, or country_id.

Each of the four parameters can accept a single region as a str object or multiple regions via several common iterables.

Below we select three regions:

regions = ['New York', 'FL', 35]
casestudy = CaseStudy(
    bf, regions=regions, count_categories=CaseStudy.BASECOUNT_CATS, 
)
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=5.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

We can see that all three regions are indeed in the object by grouping:

pd.concat([df_group.iloc[:1] for region_id, df_group in casestudy.df.groupby('region_id')]).head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	cases_dma	cases_new	cases_new_dma	deaths_dma	deaths_new	deaths_new_dma	tests_dma	tests_new	tests_new_dma
53399	35	110	SIC	Sicilia	ITA	Italy	2020-03-12	102.712067	2.000000	973.321711	...	77.406196	28.580749	15.778955	0.666667	2.000000	0.666667	796.493912	186.492921	140.803254
17846	64	236	FL	Florida	USA	United States of America (the)	2020-03-11	28.000000	2.526828	329.000000	...	21.666667	9.000000	3.666667	0.842276	2.526828	0.842276	242.666667	88.000000	64.666667
40070	75	236	NY	New York	USA	United States of America (the)	2020-03-15	729.000000	3.143533	6916.080830	...	558.000000	205.000000	171.000000	1.047844	3.143533	1.047844	5149.016931	2583.035500	2170.676861

3 rows × 25 columns

The region and country filters are important mechanisms for isolating data.

Here, we focus on US regions only, but exclude some of the most impacted ones:

casestudy.countries = ['USA']
casestudy.excluded_regions = ['NY', 'NJ']
casestudy.regions = None
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=120.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=48.0), HTML(value='')))

Because certain regions were assigned in the previous CaseStudy instantiation, we must set regions=None above in order to ask ALL the regions of the baseframe.

And below we can see that we have various US states in the dataset and that New York or New Jersey are not included.

casestudy.df.region_name.unique()

array(['Alabama', 'Wyoming', 'Alaska', 'Arkansas', 'Delaware', 'Idaho',
       'Maine', 'Mississippi', 'Montana', 'New Mexico', 'North Dakota',
       'South Dakota', 'West Virginia', 'Michigan', 'Vermont', 'Georgia',
       'Colorado', 'Florida', 'Oregon', 'Texas', 'Illinois',
       'Pennsylvania', 'Iowa', 'Maryland', 'North Carolina', 'Washington',
       'California', 'Massachusetts', 'Oklahoma', 'Arizona',
       'Connecticut', 'Minnesota', 'Virginia', 'New Hampshire', 'Hawaii',
       'Nevada', 'Indiana', 'Kentucky', 'District of Columbia',
       'Missouri', 'Louisiana', 'Ohio', 'Wisconsin', 'Kansas', 'Utah',
       'Tennessee', 'South Carolina', 'Nebraska'], dtype=object)

pd.concat([df_group.iloc[:1] for region_id, df_group in casestudy.df.groupby('region_id')]).head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	cases_dma	cases_new	cases_new_dma	deaths_dma	deaths_new	deaths_new_dma	tests_dma	tests_new	tests_new_dma
691	44	236	AL	Alabama	USA	United States of America (the)	2020-03-26	558.514091	1.26695	10468.861581	...	369.399307	246.143562	124.727455	0.422317	1.26695	0.422317	7859.521030	3287.002892	1929.975539
64339	48	236	WY	Wyoming	USA	United States of America (the)	2020-04-13	316.114653	1.00000	9715.352851	...	305.385913	16.093110	8.429724	0.333333	1.00000	0.333333	9166.923029	822.644733	529.424828
1094	49	236	AK	Alaska	USA	United States of America (the)	2020-03-25	53.977249	1.00000	3783.772189	...	42.839087	7.711036	8.567817	0.333333	1.00000	0.333333	2745.528371	1496.950677	539.260259

3 rows × 25 columns

casestudy.df[casestudy.df.region_name.isin(['NY', 'NJ'])]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	cases_dma	cases_new	cases_new_dma	deaths_dma	deaths_new	deaths_new_dma	tests_dma	tests_new	tests_new_dma	days

0 rows × 25 columns

Limiting data via different start and tail hurdles

Parameters exist that allow you to filter the dataset such that regions and days appear only if they meet certain criteria.

start_factor and start_hurdle provide the ability to effectively crop the beginning of region's period of data.

tail_factor and tail_hurdle do the same for the end of a region's period.

start_factor and tail_factor accept any dynamic factor in the dataset (including date).

The hurdle is the level of the specified factor the region must reach to be included. For instance, if start_factor=cases_new_per_1M and start_hurdle=100, each region's first row in casestudy.df will be the day that the region met or exceeded 100 new cases per 1 million people.

These options are a convenient way to compare regions that have been impacted to a similar extent or, perhaps, to fairly compare regions that were impacted at different times.

The default parameters for start_factor and start_hurdle limit the data to regions with at least one cumulative fatality.

NOTE: a days column is added to casestudy.df. This is a count of the number of days from the current date back to the first date in the casestudy. When a start_factor is provided, this is the first date that the start_hurdle is met. When start_factor is not provided, this is the first date in the dataset.

Examples are show below.

casestudy = CaseStudy(
    bf, regions='Spain', count_categories=CaseStudy.BASECOUNT_CATS, 
    start_factor='cases', start_hurdle=1000
)
casestudy.make()
casestudy.df.head(2)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	cases_dma	cases_new	cases_new_dma	deaths_dma	deaths_new	deaths_new_dma	tests_dma	tests_new	tests_new_dma	days
55820	491	209	ESP	Spain	ESP	Spain	2020-03-09	1057.840245	27.344784	NaN	...	738.089217	394.348647	221.163866	17.904323	10.742594	7.487262	NaN	NaN	NaN	0 days
55821	491	209	ESP	Spain	ESP	Spain	2020-03-10	1671.052390	34.180981	NaN	...	1130.794744	613.212146	392.705527	26.042652	6.836196	8.138329	NaN	NaN	NaN	1 days

2 rows × 25 columns

casestudy = CaseStudy(
    bf, countries='Sweden', 
    count_categories='deaths_new', start_factor='deaths_new', start_hurdle=100
)
casestudy.make()
casestudy.df.head(2)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	population	land_KM2	land_dens	city_KM2	city_dens	deaths_new	days
56656	495	214	SWE	Sweden	SWE	Sweden	2020-04-06	7438.936775	675.770207	NaN	9415570.0	415314.854224	22.67092	2150.411192	4378.497486	107.669886	0 days
56657	495	214	SWE	Sweden	SWE	Sweden	2020-04-07	7941.679240	837.275037	NaN	9415570.0	415314.854224	22.67092	2150.411192	4378.497486	161.504829	1 days

To see the earliest dates in the dataframe, prior to any deaths being recorded, set start_factor to ''.

casestudy.countries = None
casestudy.regions = ['RJ']
casestudy.count_categories = ['tests_new_dma']
casestudy.factors = ['temp', 'strindex']
casestudy.start_factor = ''
casestudy.make()
casestudy.df.head(2)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=3.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	population	land_KM2	land_dens	city_KM2	city_dens	tests_new_dma	temp	strindex	days
48480	557	31	RJ	Rio De Janeiro	BRA	Brazil	2020-01-01	NaN	NaN	NaN	15962668.0	42269.311478	377.642016	2203.766328	7243.357792	NaN	294.134674	0.0	0 days
48481	557	31	RJ	Rio De Janeiro	BRA	Brazil	2020-01-02	NaN	NaN	NaN	15962668.0	42269.311478	377.642016	2203.766328	7243.357792	NaN	294.375153	0.0	1 days

4.3 Smoothing

Smoothing is applied two ways within the make method.

The first addresses NaN values within the count_type time-series. Sometimes there are artifacts and one-offs within the set. Other times, as with test counts in many regions, the count is only update periodically and NaNs fill the gaps.

In these instances, make interpolates between the real values to fill in the gaps. The default method is linear interpolation, but this can be overriden by providing interpolation_method (see Pandas docs for options).

For instance, below we see that Spain testing data as follows:

casestudy = CaseStudy(bf, regions='Spain')
casestudy.make()
casestudy.df.tests.tail(20)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=3.0, st…


2020-08-02 06:17:58,268	INFO resource_spec.py:212 -- Starting Ray with 12.84 GiB memory available for workers and up to 6.44 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-08-02 06:17:58,495	WARNING services.py:923 -- Redis failed to start, retrying now.
2020-08-02 06:17:58,792	INFO services.py:1165 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))





55934    3.619554e+06
55935    3.644458e+06
55936    3.673778e+06
55937    3.703099e+06
55938    3.732419e+06
55939    3.761740e+06
55940    3.791060e+06
55941    3.820381e+06
55942    3.849701e+06
55943    3.881696e+06
55944    3.913690e+06
55945    3.945685e+06
55946    3.977680e+06
55947    4.009675e+06
55948    4.041669e+06
55949    4.073664e+06
55950    4.073664e+06
55951    4.073664e+06
55952    4.073664e+06
55953    4.073664e+06
Name: tests, dtype: float64

But when we set interpolate=Flase, we can see that in fact Spain updates its testing only weekly.

casestudy = CaseStudy(bf, regions='Spain', interpolate=False)
casestudy.make()
casestudy.df.tests.tail(20)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))





55934          NaN
55935    3644458.0
55936          NaN
55937          NaN
55938          NaN
55939          NaN
55940          NaN
55941          NaN
55942    3849701.0
55943          NaN
55944          NaN
55945          NaN
55946          NaN
55947          NaN
55948          NaN
55949    4073664.0
55950          NaN
55951          NaN
55952          NaN
55953          NaN
Name: tests, dtype: float64

The second approach is new in 0.3.6. CaseStudy automatically applies smoothing to negative values and large outliers in the main count_categories (cases, deaths, and tests).

Many regions have chosen to "adjust" or "catch up" their case or fatality counts, not be adjusting the actual dates that the outcome occured, but instead on a seemingly random reporting date. This creates strange artifacts in the time series.

For example, Spain has dip in daily case counts to the negative in late April 2020:

casestudy = CaseStudy(bf, regions='Spain', smooth=False)
casestudy.make()
casestudy.compchart.make(x_category='date', y_category='deaths_new', figsize=(8,4))

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=1.0, st…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))



Daily Deaths

png

With smooth=True (the default setting), this deep negative value is redistributed through prior dates according to the distribution of counts up to the date with the negative value.

This is a somewhat nieve approach but has the benefit of maintaining a consistent shape to the time-series.

casestudy = CaseStudy(bf, regions='Spain', smooth=True)
casestudy.make()
casestudy.compchart.make(x_category='date', y_category='deaths_new', figsize=(8,4))

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


Daily Deaths

png

The same adjustment is made for VERY large increases in counts relative to the cumulative total and to the daily rate. For example, see New York below:

casestudy = CaseStudy(bf, regions='NY', smooth=False)
casestudy.make()
casestudy.compchart.make(x_category='date', y_category='deaths_new', figsize=(8,4))

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=1.0, st…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


Daily Deaths

png

casestudy = CaseStudy(bf, regions='NY', smooth=True)
casestudy.make()
casestudy.compchart.make(x_category='date', y_category='deaths_new', figsize=(8,4))

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


Daily Deaths

png

4.4 Available Factors

The remaining columns in the baseframe can be included in a CaseStudy instance on an opt-in basis via the factors attribute:

casestudy = CaseStudy(bf, count_categories='cases_new_per_person_per_land_KM2', factors=['no2', 'strindex'])
casestudy.make()
casestudy.df.head(2)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=659.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	population	land_KM2	land_dens	city_KM2	city_dens	cases_new_per_person_per_land_KM2	no2	strindex	days
43905	32	110	TRE	P.A. Trento	ITA	Italy	2020-03-12	131.523112	1.096661	652.429603	515201.0	2938.79544	175.310262	2938.79544	175.310262	0.210345	NaN	85.19	0 days
43906	32	110	TRE	P.A. Trento	ITA	Italy	2020-03-13	200.357639	2.193322	930.784897	515201.0	2938.79544	175.310262	2938.79544	175.310262	0.392644	NaN	85.19	1 days

For convenience, a number of factor groupings can be accessed via CaseStudy attributes:

GMOBIS, AMOBIS, CAUSES, MAJOR_CAUSES, POLLUTS, TEMP_MSMTS, MSMTS
- various groupings for factor data
- GMOBIS refer to Google Mobility data.
- AMOBIS refer to Apple Mobility data.
STRINDEX_CATS, CONTAIN_CATS, ECON_CATS, HEALTH_CATS
- groupings for the Oxford Stringency Index

print (CaseStudy.MSMTS)
print (CaseStudy.MAJOR_CAUSES)

['uvb', 'rhum', 'temp', 'dewpoint']
['circul', 'infectious', 'respir', 'endo']

Different demographic population age groupings can be accessed as well:

ALL_RANGES - all the possible demographic age ranges
RANGES - a dictionary of various groupings of age ranges

from see19 import RANGES
RANGES.keys()

dict_keys(['UNDERS', 'OVERS', 'SCHOOL_GOERS', 'Y_MILLS', 'MILLS', 'MID', 'MID_PLUS'])

overs = RANGES['OVERS']['ranges']
casestudy = CaseStudy(bf, regions='Lombardia', count_categories='deaths_new_per_person_per_land_KM2', factors=overs)
casestudy.make()
casestudy.df.head(2)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	A70PLUSB	A75PLUSB	A80PLUSB	A85PLUSB	A65PLUSB_%	A70PLUSB_%	A75PLUSB_%	A80PLUSB_%	A85PLUSB_%	days
31566	36	110	LOM	Lombardia	ITA	Italy	2020-02-24	216.225177	6.0	943.732875	...	1490749.0	963768.0	0.0	0.0	0.208224	0.154784	0.100068	0.0	0.0	0 days
31567	36	110	LOM	Lombardia	ITA	Italy	2020-02-25	301.709549	9.0	2386.747531	...	1490749.0	963768.0	0.0	0.0	0.208224	0.154784	0.100068	0.0	0.0	1 days

2 rows × 27 columns

casestudy = CaseStudy(bf, regions='LOM', count_categories='deaths_new_per_person_per_land_KM2', factors=CaseStudy.MAJOR_CAUSES)
casestudy.make()
casestudy.df.head(2)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=2.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	deaths_new_per_person_per_land_KM2	circul	infectious	respir	endo	circul_%	infectious_%	respir_%	endo_%	days
31566	36	110	LOM	Lombardia	ITA	Italy	2020-02-24	216.225177	6.0	943.732875	...	NaN	74695	4630	20185	6566.0	0.007756	0.000481	0.002096	0.000682	0 days
31567	36	110	LOM	Lombardia	ITA	Italy	2020-02-25	301.709549	9.0	2386.747531	...	0.00507	74695	4630	20185	6566.0	0.007756	0.000481	0.002096	0.000682	1 days

2 rows × 25 columns

Some factors are only available at a country level.

By setting country_level=True, casestudy will aggregate most data among the subregions up to the country level to allow for proper comparison across the broad range of countries.

The Oxford Stringency Index and its derivatives is one such data group only available at the country level.

casestudy = CaseStudy(bf, 
    count_categories='deaths_new_per_person_per_land_KM2', 
    factors='strindex',
    country_level=True,
)
casestudy.make()
casestudy.df.tail(2)

/Users/spindicate/Documents/programming/zooscraper/casestudy/see19/see19/study/ray.py:16: UserWarning: smoothing is unavailable when country_level=True
  super().__init__(*args, **kwargs)



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…






HBox(children=(FloatProgress(value=0.0, max=155.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	population	land_KM2	land_dens	city_KM2	city_dens	deaths_new_per_person_per_land_KM2	strindex	days
36560	id_for_USA	236	USA	name_for_USA	USA	United States of America (the)	2020-07-19	3725463.0	131737.0	45313502.0	307692971.0	9.087502e+06	33.858916	710152.024025	433.277609	15.446448	68.98	144 days
36561	id_for_USA	236	USA	name_for_USA	USA	United States of America (the)	2020-07-20	3782891.0	132095.0	46043131.0	307692971.0	9.087502e+06	33.858916	710152.024025	433.277609	10.573286	68.98	145 days

Above you can see that all US states have been aggregated into a single region with an region_id

With respect to the STRINDEX_CATS subgroups, if all the required categories are provided, CaseStudy will sum the individual category values.

For example, if CONTAIN_CATS are provided, the aggregate of the eight categories will be included in the c_sum column.

Note if all five h indicators are provided, CaseStudy will also tabulate a key3_sum, which aggregates the scores on the h1, h2, and h3 indicators.

casestudy = CaseStudy(bf, 
    count_categories='deaths_new_per_person_per_land_KM2', 
    factors=CaseStudy.CONTAIN_CATS,
    country_level=True,
)
casestudy.make()
casestudy.df.tail(2)

/Users/spindicate/Documents/programming/zooscraper/casestudy/see19/see19/study/ray.py:16: UserWarning: smoothing is unavailable when country_level=True
  super().__init__(*args, **kwargs)



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, max=155.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	c1	c2	c3	c4	c5	c6	c7	c8	c_sum	days
36560	id_for_USA	236	USA	name_for_USA	USA	United States of America (the)	2020-07-19	3725463.0	131737.0	45313502.0	...	3.0	2.0	2.0	4.0	1.0	2.0	2.0	3.0	19.0	144 days
36561	id_for_USA	236	USA	name_for_USA	USA	United States of America (the)	2020-07-20	3782891.0	132095.0	46043131.0	...	3.0	2.0	2.0	4.0	1.0	2.0	2.0	3.0	19.0	145 days

2 rows × 26 columns

Additional computations can be added for each factor via the factor_dmas attribute.

The attribute is a dictionary of the form str(factor_name): int(dma).

When provided, CaseStudy will automatically add _dma, _growth, and _growth_dma computations

casestudy = CaseStudy(bf, count_categories='deaths_new_dma_per_1M', 
    factors=['temp', 'c1', 'strindex'], 
    factor_dmas={'temp': 7, 'c1': 14},
    country_level=True,
)
casestudy.make()
casestudy.df.head(2)

/Users/spindicate/Documents/programming/zooscraper/casestudy/see19/see19/study/ray.py:16: UserWarning: smoothing is unavailable when country_level=True
  super().__init__(*args, **kwargs)



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, max=155.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	temp	c1	strindex	temp_dma	temp_growth	temp_growth_dma	c1_dma	c1_growth	c1_growth_dma	days
81	293	1	AFG	Afghanistan	AFG	Afghanistan	2020-03-22	40.0	1.0	NaN	...	10.778741	3.0	41.67	7.908977	1.067747	1.384819	1.928571	1.0	NaN	0 days
82	293	1	AFG	Afghanistan	AFG	Afghanistan	2020-03-23	40.0	1.0	NaN	...	8.560785	3.0	41.67	8.784692	0.794229	1.150845	2.142857	1.0	NaN	1 days

2 rows × 26 columns

NOTE: When country_level=True, smooth is currently NOT available as per warning and Ray multi-processing is also NOT available.

To provide a single dma for all the factors submitted, build the dictionary ahead of time:

factor_dmas = {msmt: 14 for msmt in CaseStudy.MSMTS}
casestudy = CaseStudy(
    bf, count_categories='tests_new_per_1M', 
    factors=CaseStudy.MSMTS, factor_dmas=factor_dmas
)
casestudy.make()
casestudy.df.head(2)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=659.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	country_id	region_code	region_name	country_code	country	date	cases	deaths	tests	...	rhum_dma	rhum_growth	rhum_growth_dma	temp_dma	temp_growth	temp_growth_dma	dewpoint_dma	dewpoint_growth	dewpoint_growth_dma	days
43905	32	110	TRE	P.A. Trento	ITA	Italy	2020-03-12	131.523112	1.096661	652.429603	...	90.025840	1.050915	0.996733	3.513738	0.959184	1.105750	-3.142554	1.896068	-0.635699	0 days
43906	32	110	TRE	P.A. Trento	ITA	Italy	2020-03-13	200.357639	2.193322	930.784897	...	89.967379	0.995192	1.001809	3.242550	1.053689	1.114479	-3.447804	1.026207	-0.735813	1 days

2 rows × 33 columns

Other factors are adjusted to population. These factors are appended with _% and can be seen via the pop_cats attribute.

These are typically time-static factors.

casestudy = CaseStudy(bf, count_categories='deaths_new_dma_per_1M', factors=['visitors', 'gdp', 'A65PLUSB' ])
print (casestudy.pop_cats)
casestudy.make()
casestudy.df[['region_name', 'date', 'visitors_%', 'gdp_%', 'A65PLUSB_%']].head(2)

['A65PLUSB', 'visitors', 'gdp']



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=659.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_name	date	visitors_%	gdp_%	A65PLUSB_%
43905	P.A. Trento	2020-03-12	19.864474	54504.746691	0.203018
43906	P.A. Trento	2020-03-13	19.864474	54504.746691	0.203018

4.5 Additional Flags

There are several additional flags and methods that will be touched on briefly, however, you are encouraged to read the analysis pages to see them in action.

world_averages: when set to True, averages each date in the dataset across all the regions, to provide a per_region statistic for each factor
favor_earlier: when set to True, scales any selected rows such that values earlier in the dataset receive more weight than later ones. A new column is added with the _earlier suffix. This is helpful when attempting to study the impacts of early moves to, say, social distance. Factors are selected by passing a list to the factors_to_favor_earlier parameter.

4.6 RayStudy v BaseStudy

The default implementation of make utilizes both Ray and Numba to significantly improve the performance.

Ray is a 3rd party multi-processing package. For see19 purposes, Ray's key feature is the ability to share (albeit read-only) large objects among different live processes. Python's standard multi-processing module does not allow for simple access to the baseframe and, therefore, did not provide any performance benefits.

Numba provides just-in-time compiling of certain numpy implementations. The custom Numba function typically provides 10x speed improvement versus the same built-in Pandas method.

Ray is not compatible with Windows. CaseStudy will attempt to detect incompatibility and revert to a single-process method where necessary.*

To support this, a root BaseStudy implementation provides single process functionality and a RayStudy child that implements Ray functionality. CaseStudy inherits from either class automatically based on operating system.

You can see which class is inherited as per below (this is on a Macbook)

CaseStudy.__bases__

(casestudy.see19.see19.study.ray.RayStudy,)

To use the non-Ray implementation, you can either import BaseStudy directly or set use_ray=False on CaseStudy.

We can see both approaches provide similar results below.

# from see19.study.base import BaseStudy
from casestudy.see19.see19.study.base import BaseStudy
from datetime import datetime as dt

def clockwrap(func):
    def wrapper(*args, **kwargs):
        start = dt.now()
        func()
        end = dt.now()

        return end - start

    return wrapper()

casestudy = BaseStudy(bf)
dur1 = clockwrap(casestudy.make)
print (dur1)

/Users/spindicate/Documents/programming/envs/zooenv/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: It looks like you called BaseStudy directly. This is not recommended. Ray provides significant performance improvements and certain BaseStudy methods are not optimized.
  """Entry point for launching an IPython kernel.



HBox(children=(FloatProgress(value=0.0, max=537.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=298.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))


0:00:28.674439

casestudy = CaseStudy(bf, use_ray=False)
dur2 = clockwrap(casestudy.make)
print (dur2)

/Users/spindicate/Documents/programming/envs/zooenv/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: use_ray set to False. This is not recommended. Ray provides significant performance improvements and certain BaseStudy methods are not optimized.
  """Entry point for launching an IPython kernel.



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, max=537.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=298.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))


0:00:27.573194

Now we'll compare that with the default Ray implemenation on an 8-core MacBook Pro.

casestudy = CaseStudy(bf)
dur3 = clockwrap(casestudy.make)
print (dur3)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=659.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=285.0), HTML(value='')))


0:00:06.225569

diff = 1 - dur3 / (np.mean([dur1, dur2]))
print ('You can see that the Ray implementation is \033[4m\033[1m{:.2%}\033[0m faster.'.format(diff))

You can see that the Ray implementation is [4m[1m77.86%[0m faster.

Note: Both Numba and Ray perform caching on the first call of a function. Thus, on the first session call to make() method, there will be additional delay (due to many functions being cached). All subsequent calls will experience the significant performance improvements.

4.7 Chart Objects

Each casestudy object currently contains 6 different chart objects, that provide visual tools for analysising, assessing and comparing COVID-19s impact on different regions and factors. Each chart is created via matplotlib. Details of each chart object are provided in future sections.

The chart classes can be found in the chart module, along with the BaseChart root which provides common functionality.

compchart from CompChart2D
compchart4d from CompChart4D
heatmap from HeatMap
barcharts from BarCharts
scatterflow from ScatterFlow
substrinscat from SubStrindexScatter

Each chart has been designed to align closely with the CaseStudy functionality and with the underlying functionality of matplotlib.

For instance, each chart is called via the make method.

casestudy.regions = ['NY', 'NJ']
casestudy.make()
leg = {'fontsize': 12, 'handlelength': 1}
casestudy.compchart.make(x_category='days', y_category='cases', figsize=(8,4), legend_params=leg)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=5.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))


Cumulative Cases

png

Each chart object is automatically updated on each make call, so any changes to the casestudy object, will also be reflected in the charts.

casestudy.regions = ['AB', 'ON']
casestudy.make()
casestudy.compchart.make(x_category='days', y_category='cases', figsize=(8,4), legend_params=leg)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=4.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))


Cumulative Cases

png

Note a prior version of see19 implemented compchart using Bokeh. This chart is deprecated and replaced with a matplotlib version but is still avialable under CompChart2DBokeh.

5. compchart - Visualizing Regional Impacts

5.1 Daily Fatalities Comparison - Italy
5.2 Daily Fatalities Comparison - 5 Most Impacted Regions
5.3 Varying the Categories

compchart attribute is an instance of the CompChart2D class and provides standard line graphs comparing regions on different categories provided to x_category & y_category. Time-series is supported when x_category='date'.

Charts are available in multi-line format with optional overlay of a second factor on a separate y-axis.

5.1 Daily Fatalities Comparison - Italy

We will illustrate with an example, focusing on only the three most impacted regions in Italy.

itaregs = bf[bf['country'] == 'Italy'] \
    .sort_values(by='deaths', ascending=False).region_name.unique().tolist()[:3]

casestudy = CaseStudy(bf, regions=itaregs, start_hurdle=3, start_factor='deaths', smooth=False)
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=1.0, st…



HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

When CaseStudy is instantiated, compchart is also instantiated with its own attributes.

print (casestudy.compchart)

<casestudy.see19.see19.charts.CompChart2D object at 0x32dee3950>

In particular, all the various available categories are automatically provided labels via the label attribute. A few are shown below for illustration purposes.

for k,v in casestudy.compchart.labels.items():
    print ('{}: {}'.format(k, v))
    if k == 'temp':
        break

cases_dma: Cumulative Cases (3DMA)
cases_new: Daily Cases
cases_new_dma: Daily Cases (3DMA)
deaths_dma: Cumulative Deaths (3DMA)
deaths_new: Daily Deaths
deaths_new_dma: Daily Deaths (3DMA)
tests_dma: Cumulative Tests (3DMA)
tests_new: Daily Tests
tests_new_dma: Daily Tests (3DMA)
cases: Cumulative Cases
deaths: Cumulative Deaths
tests: Cumulative Tests
cases_dma_per_1K: Cumulative Cases per 1K (3DMA)
cases_dma_per_1M: Cumulative Cases per 1M (3DMA)
cases_dma_per_person_per_land_KM2: Cumulative Cases / Person / Land KM² (3DMA)
cases_dma_per_person_per_city_KM2: Cumulative Cases / Person / City KM² (3DMA)
cases_new_per_1K: Daily Cases per 1K
cases_new_per_1M: Daily Cases per 1M
cases_new_per_person_per_land_KM2: Daily Cases / Person / Land KM²
cases_new_per_person_per_city_KM2: Daily Cases / Person / City KM²
cases_new_dma_per_1K: Daily Cases per 1K (3DMA)
cases_new_dma_per_1M: Daily Cases per 1M (3DMA)
cases_new_dma_per_person_per_land_KM2: Daily Cases / Person / Land KM² (3DMA)
cases_new_dma_per_person_per_city_KM2: Daily Cases / Person / City KM² (3DMA)
deaths_dma_per_1K: Cumulative Deaths per 1K (3DMA)
deaths_dma_per_1M: Cumulative Deaths per 1M (3DMA)
deaths_dma_per_person_per_land_KM2: Cumulative Deaths / Person / Land KM² (3DMA)
deaths_dma_per_person_per_city_KM2: Cumulative Deaths / Person / City KM² (3DMA)
deaths_new_per_1K: Daily Deaths per 1K
deaths_new_per_1M: Daily Deaths per 1M
deaths_new_per_person_per_land_KM2: Daily Deaths / Person / Land KM²
deaths_new_per_person_per_city_KM2: Daily Deaths / Person / City KM²
deaths_new_dma_per_1K: Daily Deaths per 1K (3DMA)
deaths_new_dma_per_1M: Daily Deaths per 1M (3DMA)
deaths_new_dma_per_person_per_land_KM2: Daily Deaths / Person / Land KM² (3DMA)
deaths_new_dma_per_person_per_city_KM2: Daily Deaths / Person / City KM² (3DMA)
tests_dma_per_1K: Cumulative Tests per 1K (3DMA)
tests_dma_per_1M: Cumulative Tests per 1M (3DMA)
tests_dma_per_person_per_land_KM2: Cumulative Tests / Person / Land KM² (3DMA)
tests_dma_per_person_per_city_KM2: Cumulative Tests / Person / City KM² (3DMA)
tests_new_per_1K: Daily Tests per 1K
tests_new_per_1M: Daily Tests per 1M
tests_new_per_person_per_land_KM2: Daily Tests / Person / Land KM²
tests_new_per_person_per_city_KM2: Daily Tests / Person / City KM²
tests_new_dma_per_1K: Daily Tests per 1K (3DMA)
tests_new_dma_per_1M: Daily Tests per 1M (3DMA)
tests_new_dma_per_person_per_land_KM2: Daily Tests / Person / Land KM² (3DMA)
tests_new_dma_per_person_per_city_KM2: Daily Tests / Person / City KM² (3DMA)
cases_per_1K: Cumulative Cases per 1K
cases_per_1M: Cumulative Cases per 1M
cases_per_person_per_land_KM2: Cumulative Cases / Person / Land KM²
cases_per_person_per_city_KM2: Cumulative Cases / Person / City KM²
deaths_per_1K: Cumulative Deaths per 1K
deaths_per_1M: Cumulative Deaths per 1M
deaths_per_person_per_land_KM2: Cumulative Deaths / Person / Land KM²
deaths_per_person_per_city_KM2: Cumulative Deaths / Person / City KM²
tests_per_1K: Cumulative Tests per 1K
tests_per_1M: Cumulative Tests per 1M
tests_per_person_per_land_KM2: Cumulative Tests / Person / Land KM²
tests_per_person_per_city_KM2: Cumulative Tests / Person / City KM²
cases_dma_lognat: Cumulative Cases (3DMA)
(Natural Log)
cases_new_lognat: Daily Cases
(Natural Log)
cases_new_dma_lognat: Daily Cases (3DMA)
(Natural Log)
deaths_dma_lognat: Cumulative Deaths (3DMA)
(Natural Log)
deaths_new_lognat: Daily Deaths
(Natural Log)
deaths_new_dma_lognat: Daily Deaths (3DMA)
(Natural Log)
tests_dma_lognat: Cumulative Tests (3DMA)
(Natural Log)
tests_new_lognat: Daily Tests
(Natural Log)
tests_new_dma_lognat: Daily Tests (3DMA)
(Natural Log)
cases_lognat: Cumulative Cases
(Natural Log)
deaths_lognat: Cumulative Deaths
(Natural Log)
tests_lognat: Cumulative Tests
(Natural Log)
cases_dma_per_1K_lognat: Cumulative Cases per 1K (3DMA)
(Natural Log)
cases_dma_per_1M_lognat: Cumulative Cases per 1M (3DMA)
(Natural Log)
cases_dma_per_person_per_land_KM2_lognat: Cumulative Cases / Person / Land KM² (3DMA)
(Natural Log)
cases_dma_per_person_per_city_KM2_lognat: Cumulative Cases / Person / City KM² (3DMA)
(Natural Log)
cases_new_per_1K_lognat: Daily Cases per 1K
(Natural Log)
cases_new_per_1M_lognat: Daily Cases per 1M
(Natural Log)
cases_new_per_person_per_land_KM2_lognat: Daily Cases / Person / Land KM²
(Natural Log)
cases_new_per_person_per_city_KM2_lognat: Daily Cases / Person / City KM²
(Natural Log)
cases_new_dma_per_1K_lognat: Daily Cases per 1K (3DMA)
(Natural Log)
cases_new_dma_per_1M_lognat: Daily Cases per 1M (3DMA)
(Natural Log)
cases_new_dma_per_person_per_land_KM2_lognat: Daily Cases / Person / Land KM² (3DMA)
(Natural Log)
cases_new_dma_per_person_per_city_KM2_lognat: Daily Cases / Person / City KM² (3DMA)
(Natural Log)
deaths_dma_per_1K_lognat: Cumulative Deaths per 1K (3DMA)
(Natural Log)
deaths_dma_per_1M_lognat: Cumulative Deaths per 1M (3DMA)
(Natural Log)
deaths_dma_per_person_per_land_KM2_lognat: Cumulative Deaths / Person / Land KM² (3DMA)
(Natural Log)
deaths_dma_per_person_per_city_KM2_lognat: Cumulative Deaths / Person / City KM² (3DMA)
(Natural Log)
deaths_new_per_1K_lognat: Daily Deaths per 1K
(Natural Log)
deaths_new_per_1M_lognat: Daily Deaths per 1M
(Natural Log)
deaths_new_per_person_per_land_KM2_lognat: Daily Deaths / Person / Land KM²
(Natural Log)
deaths_new_per_person_per_city_KM2_lognat: Daily Deaths / Person / City KM²
(Natural Log)
deaths_new_dma_per_1K_lognat: Daily Deaths per 1K (3DMA)
(Natural Log)
deaths_new_dma_per_1M_lognat: Daily Deaths per 1M (3DMA)
(Natural Log)
deaths_new_dma_per_person_per_land_KM2_lognat: Daily Deaths / Person / Land KM² (3DMA)
(Natural Log)
deaths_new_dma_per_person_per_city_KM2_lognat: Daily Deaths / Person / City KM² (3DMA)
(Natural Log)
tests_dma_per_1K_lognat: Cumulative Tests per 1K (3DMA)
(Natural Log)
tests_dma_per_1M_lognat: Cumulative Tests per 1M (3DMA)
(Natural Log)
tests_dma_per_person_per_land_KM2_lognat: Cumulative Tests / Person / Land KM² (3DMA)
(Natural Log)
tests_dma_per_person_per_city_KM2_lognat: Cumulative Tests / Person / City KM² (3DMA)
(Natural Log)
tests_new_per_1K_lognat: Daily Tests per 1K
(Natural Log)
tests_new_per_1M_lognat: Daily Tests per 1M
(Natural Log)
tests_new_per_person_per_land_KM2_lognat: Daily Tests / Person / Land KM²
(Natural Log)
tests_new_per_person_per_city_KM2_lognat: Daily Tests / Person / City KM²
(Natural Log)
tests_new_dma_per_1K_lognat: Daily Tests per 1K (3DMA)
(Natural Log)
tests_new_dma_per_1M_lognat: Daily Tests per 1M (3DMA)
(Natural Log)
tests_new_dma_per_person_per_land_KM2_lognat: Daily Tests / Person / Land KM² (3DMA)
(Natural Log)
tests_new_dma_per_person_per_city_KM2_lognat: Daily Tests / Person / City KM² (3DMA)
(Natural Log)
cases_per_1K_lognat: Cumulative Cases per 1K
(Natural Log)
cases_per_1M_lognat: Cumulative Cases per 1M
(Natural Log)
cases_per_person_per_land_KM2_lognat: Cumulative Cases / Person / Land KM²
(Natural Log)
cases_per_person_per_city_KM2_lognat: Cumulative Cases / Person / City KM²
(Natural Log)
deaths_per_1K_lognat: Cumulative Deaths per 1K
(Natural Log)
deaths_per_1M_lognat: Cumulative Deaths per 1M
(Natural Log)
deaths_per_person_per_land_KM2_lognat: Cumulative Deaths / Person / Land KM²
(Natural Log)
deaths_per_person_per_city_KM2_lognat: Cumulative Deaths / Person / City KM²
(Natural Log)
tests_per_1K_lognat: Cumulative Tests per 1K
(Natural Log)
tests_per_1M_lognat: Cumulative Tests per 1M
(Natural Log)
tests_per_person_per_land_KM2_lognat: Cumulative Tests / Person / Land KM²
(Natural Log)
tests_per_person_per_city_KM2_lognat: Cumulative Tests / Person / City KM²
(Natural Log)
cases_dma_log: Cumulative Cases (3DMA)
(Log Base 10)
cases_new_log: Daily Cases
(Log Base 10)
cases_new_dma_log: Daily Cases (3DMA)
(Log Base 10)
deaths_dma_log: Cumulative Deaths (3DMA)
(Log Base 10)
deaths_new_log: Daily Deaths
(Log Base 10)
deaths_new_dma_log: Daily Deaths (3DMA)
(Log Base 10)
tests_dma_log: Cumulative Tests (3DMA)
(Log Base 10)
tests_new_log: Daily Tests
(Log Base 10)
tests_new_dma_log: Daily Tests (3DMA)
(Log Base 10)
cases_log: Cumulative Cases
(Log Base 10)
deaths_log: Cumulative Deaths
(Log Base 10)
tests_log: Cumulative Tests
(Log Base 10)
cases_dma_per_1K_log: Cumulative Cases per 1K (3DMA)
(Log Base 10)
cases_dma_per_1M_log: Cumulative Cases per 1M (3DMA)
(Log Base 10)
cases_dma_per_person_per_land_KM2_log: Cumulative Cases / Person / Land KM² (3DMA)
(Log Base 10)
cases_dma_per_person_per_city_KM2_log: Cumulative Cases / Person / City KM² (3DMA)
(Log Base 10)
cases_new_per_1K_log: Daily Cases per 1K
(Log Base 10)
cases_new_per_1M_log: Daily Cases per 1M
(Log Base 10)
cases_new_per_person_per_land_KM2_log: Daily Cases / Person / Land KM²
(Log Base 10)
cases_new_per_person_per_city_KM2_log: Daily Cases / Person / City KM²
(Log Base 10)
cases_new_dma_per_1K_log: Daily Cases per 1K (3DMA)
(Log Base 10)
cases_new_dma_per_1M_log: Daily Cases per 1M (3DMA)
(Log Base 10)
cases_new_dma_per_person_per_land_KM2_log: Daily Cases / Person / Land KM² (3DMA)
(Log Base 10)
cases_new_dma_per_person_per_city_KM2_log: Daily Cases / Person / City KM² (3DMA)
(Log Base 10)
deaths_dma_per_1K_log: Cumulative Deaths per 1K (3DMA)
(Log Base 10)
deaths_dma_per_1M_log: Cumulative Deaths per 1M (3DMA)
(Log Base 10)
deaths_dma_per_person_per_land_KM2_log: Cumulative Deaths / Person / Land KM² (3DMA)
(Log Base 10)
deaths_dma_per_person_per_city_KM2_log: Cumulative Deaths / Person / City KM² (3DMA)
(Log Base 10)
deaths_new_per_1K_log: Daily Deaths per 1K
(Log Base 10)
deaths_new_per_1M_log: Daily Deaths per 1M
(Log Base 10)
deaths_new_per_person_per_land_KM2_log: Daily Deaths / Person / Land KM²
(Log Base 10)
deaths_new_per_person_per_city_KM2_log: Daily Deaths / Person / City KM²
(Log Base 10)
deaths_new_dma_per_1K_log: Daily Deaths per 1K (3DMA)
(Log Base 10)
deaths_new_dma_per_1M_log: Daily Deaths per 1M (3DMA)
(Log Base 10)
deaths_new_dma_per_person_per_land_KM2_log: Daily Deaths / Person / Land KM² (3DMA)
(Log Base 10)
deaths_new_dma_per_person_per_city_KM2_log: Daily Deaths / Person / City KM² (3DMA)
(Log Base 10)
tests_dma_per_1K_log: Cumulative Tests per 1K (3DMA)
(Log Base 10)
tests_dma_per_1M_log: Cumulative Tests per 1M (3DMA)
(Log Base 10)
tests_dma_per_person_per_land_KM2_log: Cumulative Tests / Person / Land KM² (3DMA)
(Log Base 10)
tests_dma_per_person_per_city_KM2_log: Cumulative Tests / Person / City KM² (3DMA)
(Log Base 10)
tests_new_per_1K_log: Daily Tests per 1K
(Log Base 10)
tests_new_per_1M_log: Daily Tests per 1M
(Log Base 10)
tests_new_per_person_per_land_KM2_log: Daily Tests / Person / Land KM²
(Log Base 10)
tests_new_per_person_per_city_KM2_log: Daily Tests / Person / City KM²
(Log Base 10)
tests_new_dma_per_1K_log: Daily Tests per 1K (3DMA)
(Log Base 10)
tests_new_dma_per_1M_log: Daily Tests per 1M (3DMA)
(Log Base 10)
tests_new_dma_per_person_per_land_KM2_log: Daily Tests / Person / Land KM² (3DMA)
(Log Base 10)
tests_new_dma_per_person_per_city_KM2_log: Daily Tests / Person / City KM² (3DMA)
(Log Base 10)
cases_per_1K_log: Cumulative Cases per 1K
(Log Base 10)
cases_per_1M_log: Cumulative Cases per 1M
(Log Base 10)
cases_per_person_per_land_KM2_log: Cumulative Cases / Person / Land KM²
(Log Base 10)
cases_per_person_per_city_KM2_log: Cumulative Cases / Person / City KM²
(Log Base 10)
deaths_per_1K_log: Cumulative Deaths per 1K
(Log Base 10)
deaths_per_1M_log: Cumulative Deaths per 1M
(Log Base 10)
deaths_per_person_per_land_KM2_log: Cumulative Deaths / Person / Land KM²
(Log Base 10)
deaths_per_person_per_city_KM2_log: Cumulative Deaths / Person / City KM²
(Log Base 10)
tests_per_1K_log: Cumulative Tests per 1K
(Log Base 10)
tests_per_1M_log: Cumulative Tests per 1M
(Log Base 10)
tests_per_person_per_land_KM2_log: Cumulative Tests / Person / Land KM²
(Log Base 10)
tests_per_person_per_city_KM2_log: Cumulative Tests / Person / City KM²
(Log Base 10)
: January 2020
population: Population
land_dens: Density of Land Area
city_dens: Population Density of Largest City
uvb: UV-B Radiation in J / M²
rhum: Relative Humidity
strindex: Oxford Stringency Index
visitors: Annual Visitors
visitors_%: Annual Visitors as % of Population
gdp: Gross Domestic Product
gdp_%: Gross Domestic Product per Capita
retail_n_rec: Change in Retail n Recreation Mobility
transit: Change in Transit Mobility
workplaces: Change in WorkPlace Mobility
residential: Change in Residential Mobility
parks: Change in Parks Mobility
groc_n_pharm: Change in Grocery & Pharmacy Mobility
transit_apple: Change in Transit Mobility - Apple
driving_apple: Change in Driving Mobility - Apple
walking_apple: Change in Walking Mobility - Apple
c1: School Closing
c2: Workplace Closing
c3: Cancel Public Events
c4: Restrictions on Gatherings
c5: Close Public Transport
c6: Stay-at-Home Requirements
c7: Restrictions on Internal Movement
c8: International Travel Controls
e1: Income Support
e2: Debt / Contract Relief
e3: Fiscal Measures
e4: International Support
h1: Public Information Campaigns
h2: Testing Policy
h3: Contact Tracing
h4: Emergency Investment in Health Care
h5: Investment in Vaccines
key3_sum: Sum of Key 3 Categories
key3_sum_earlier: Sum of Key 3 Oxford Stingency Factor Weighted to Earlier Dates
make_sum: Custom Stringency Aggregate
neoplasms: NeoPlasms Fatalities
blood: Blood-based Fatalities
endo: Endocrine Fatalities
mental: Mental Fatalities
nervous: Nervous System Fatalities
circul: Circulatory Fatalities
infectious: Infectious Fatalities
respir: Respiratory Fatalities
digest: Digestive Fatalities
skin: Skin-related Fatalities
musculo: Musculo-skeletal Fatalities
genito: Genitourinary Fatalities
childbirth: Maternal and Childbirth Fatalities
perinatal: Perinatal Fatalities
congenital: Congenital Fatalities
other: Other Fatalities
external: External Fatalities
date: Date
temp: Temperature (°C)

make()

Similar to the main casestudy object, charts are rendered with the make method.

x_category and y_category accept any column header in casestudy.df.

make accepts many optional kwargs. Every effort is made to align these options with matplotlib standards. Appropriate options can be found via the matplotlib api. For example:

title: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.suptitle.html (except for CompCharts4D)
line_params: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html
legend_params: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html
xlabel_params: https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_xlabel.html
xtick_params: https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.tick_params.html
palette_base: https://matplotlib.org/1.2.1/examples/pylab_examples/show_colormaps.html

All of the above kwargs and many others are share amongst ALL the different see19 Chart Classes.

kwargs = {
    'x_category': 'days',
    'y_category': 'cases_new',
    'width': 12,
    'height': 8,
    'title': {'t': 'Most Impacted Regions in Italy', 'fontsize': 24, 'weight': 'demi'},
    'line_params': {'lw': 4},
    'legend_params': {'fontsize': 14, 'handlelength': 1},
    'xlabel_params': {'fontsize': 18, 'labelpad': 10},
    'ylabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 14},
    'ytick_params': {'labelsize': 14},
    'colors': ['red', 'green', 'blue']
}

casestudy.compchart.make(**kwargs)

Daily Cases

png

An optional regions parameter exists that allows you to further reduce the number of regions presented in the chart. regions accepts a list of region_id, region_code, or region_name in any combination.

Below, we also show that a matplotlib colormap can be provided via palette_base and that the x-axis label can be removed by setting xlabel=False

kwargs = {
    'regions': ['LOM', 'EMI'],
    'x_category': 'date',
    'y_category': 'deaths_new',
    'width': 12,
    'height': 8,
    'title': {'t': 'Lombardia v Emilia-Romagna', 'fontsize': 24, 'weight': 'demi'},
    'line_params': {'lw': 6},
    'legend_params': {'fontsize': 14, 'handlelength': 1},
    'xlabel': False,
    'ylabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 14},
    'ytick_params': {'labelsize': 14},
    'palette_base': 'Accent',
}

casestudy.compchart.make(**kwargs)

Daily Deaths

png

5.2 Daily Fatalities Comparison - 5 Most Impacted Regions

Now we'll look at new cases in the 5 most impacted regions globally in terms of total fatalities.

regions = list(bf.sort_values(by='deaths', ascending=False).region_name.unique())[:5]

casestudy = CaseStudy(bf, regions=regions, start_hurdle=3, start_factor='deaths', count_dma=21, log=True)
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=12.0, style=ProgressStyle(description_width…



HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

title='5 Most Impacted Regions'

kwargs = {
    'x_category': 'days',
    'y_category': 'deaths_new',
    'width': 12,
    'height': 8,
    'title': {'t': title, 'fontsize': 24, 'weight': 'demi'},
    'line_params': {'lw': 3},
    'legend_params': {'fontsize': 14},
    'xlabel_params': {'fontsize': 18, 'labelpad': 10},
    'ylabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 14},
    'ytick_params': {'labelsize': 14},
    'palette_base': 'Accent',
}
p = casestudy.compchart.make(**kwargs)

Daily Deaths

png

There are major outliers, certainly in the early days that make the graph difficult to read. The lognat adjusted category comes in handy here.

Below we also demonstrate that the regions parameter can be provided to each make to further reduce the regions covered in the chart (for convenience)

kwargs['y_category']= 'deaths_new_dma_per_1M_log'
kwargs['ylabel_params']= {'fontsize': 18, 'labelpad': 10}
kwargs['regions'] = ['France', 'India', 'United Kingdom']

p = casestudy.compchart.make(**kwargs)

Daily Deaths per 1M (21DMA)
(Log Base 10)

png

5.3 Varying the Categories

Oxford Stringency Index

compchart can be used to compare any category or factor in casestudy.df with days or date on the x-axis.

The below chart compares the Oxford Stringency Index for each selected region

regions = ['Germany', 'Spain', 'Taiwan']

casestudy = CaseStudy(
    bf, count_categories='cases_new_per_1M', regions=regions, 
    start_factor='', factors=['strindex']
)
casestudy.make()
kwargs = {
    'x_category': 'date',
    'y_category': 'strindex',
    'width': 12,
    'height': 8,
    'line_params': {'lw': 3},
    'legend_params': {'fontsize': 14},
    'xlabel_params': {'fontsize': 18, 'labelpad': 10},
    'ylabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 14},
    'ytick_params': {'labelsize': 14},
    'palette_base': 'Accent',
}
p = casestudy.compchart.make(**kwargs)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=6.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))


Oxford Stringency Index

png

These graphs work best as time-series but the x_category can also be any other category in casestudy.df. Below we can see that in New York, positive cases have steadily declined even as testing has increased. Texas and Arizona have not had the same success.

regions = ['New York', 'Texas', 'Arizona']

casestudy = CaseStudy(bf, regions=regions, count_dma=21)
casestudy.make()
kwargs = {
    'x_category': 'tests_new_dma_per_1M',
    'y_category': 'cases_new_dma_per_1M',
    'width': 12,
    'height': 8,
    'line_params': {'lw': 3},
    'legend_params': {'fontsize': 14},
    'xlabel_params': {'fontsize': 18, 'labelpad': 10},
    'ylabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 14},
    'ytick_params': {'labelsize': 14},
    'palette_base': 'Accent',
}
p = casestudy.compchart.make(**kwargs)

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=8.0, style=ProgressStyle(description_width=…



HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))


Daily Cases per 1M (21DMA)

png

Saving Files

All chart instances in see19 have a save_file option. Simply set that option to True and provide a filename and the file will be saved to yor location of choice.

6. compchart4D - Visualizing Factors in 4D

6.1 From 3D to 4D
6.2 More on the X-Axis
6.3 How Far Can We Take It?

3D charts with color-mapping can be used to explore the impact of various factors in different regions at different times.

Such '4D' maps are often criticized for lack of readability, but they have been a valuable tool for recognizing patterns.

These charts are available in CaseStudy via the compchart4d attribute, which is an instance of the CompChart4D class. The 3D representation shows the count_category for each region on z-axis with each day from the start_hurdle on the y-axis and the individual regions separated on the x-axis.

The 3D chart is a cute trick, but the real power is derived from the color_factor. This maps the color of each 3D bar to the factor one wants to investigate.

CompChart4D object utilizes matplotlib for chart creation.

6.1 From 3D to 4D

Most Impacted Regions - Brazil

First, we get region names from the baseframe, sorting as required.

Then we create the casestudy instance, including several factors that we'll cover in our analysis.

from casestudy.see19.see19 import CaseStudy

regions = bf[bf['country'] == 'Brazil'] \
    .sort_values(by='population', ascending=False) \
    .region_name.unique().tolist()[:20]

factor_dmas={'temp': 3}

casestudy = CaseStudy(
    bf, count_dma=5, 
    factors=['temp', 'c1', 'A65PLUSB', 'A75PLUSB'], factor_dmas=factor_dmas,
    regions=regions, start_hurdle=10, start_factor='cases', lognat=True,
)
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=59.0, style=ProgressStyle(description_width…



HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

4D charts are customizable in precisely the same way as CompChart2D, sharing many of the same keywords. compchart4D utilizes a couple of its own unique keywords as per below:

z_category is utilized to determine the z-axis (vertical). x- and y-axis are automatically set to regions and days.
comp_size will further trim the number of regions by ranking them on the comp_category.
a separate rank_category can be provided for this process if preferred

kwargs = {
    'title': {'s': 'Most Impacted Regions in Brazil', 'x': .47, 'y': .74, 'fontsize': 24, 'rotation': -9, 'weight': 'demi'},
    'ylabel_params': {'fontsize': 18, 'labelpad': 12},
    'zlabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 18},
    'ytick_params': {'labelsize': 12},
    'tight': True, 'comp_size': 10,
}
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)

png

df_chart: for most charts, the casestudy dataframe is morphed for presentation purposes. This morphed data is avaliable via the df_chart attribute.

casestudy.compchart4d.df_chart.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	region_name	region_code	country	date	days	deaths_new_dma_per_1M
10585	566	Ceara	CE	Brazil	2020-03-22	6 days	0.000000
10586	566	Ceara	CE	Brazil	2020-03-23	7 days	0.000000
10587	566	Ceara	CE	Brazil	2020-03-24	8 days	0.000000
10588	566	Ceara	CE	Brazil	2020-03-25	9 days	0.000000
10589	566	Ceara	CE	Brazil	2020-03-26	10 days	0.169566

Adding a Color Factor

By adding the color_factor attribute, we can see the impact, if any, of an exogenous factor on the comp_category over time.

We will start with A65PLUSB_%. As this a time-static factor, the color for each region will be the same regardless of the day.

You must provide additional options to position the color bar.

kwargs = {
    **kwargs,
    'color_category': 'A65PLUSB_%', 
    'xy_cbar': (0.09, .225), 'wh_cbar': (.015, 14),
    'cblabel_params': {'labelpad': -55},
}
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)

png

Now we'll use temp, which is a time-dynamic factor and will provide a different color for each region on each day.

kwargs = {**kwargs, 
    'color_category': 'temp',
}

p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)

png

Fixing the Color Range

NOTE: The range of colors is automatically set by make. This can be somewhat misleading when:

comparing multiple charts
when a single chart has temperatures in a narrow range. In the above example, for instance, temperatures range only between 18C - 28C and, yet, the color map runs almost the entire red-blue spectrum.

Thus, there is a color_interval option that allows you to fix the color interval. color_interval expects a tuple, where the first item is the low-end of the range and the second item is the high-end.

Fixing the color interval provides a very different picture of Brazil's impacted regions.

kwargs = {**kwargs, 'color_interval': (20,30)}
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)

png

6.2 More on the X-Axis

Top 30 US States

Now we investigate the Top 30 most impacted US states.

regions = bf[bf['country_code'] == 'USA'] \
    .sort_values('cases', ascending='False') \
    .region_name.unique().tolist()[:50]
countries = 'USA'

casestudy = CaseStudy(
    bf, regions=regions, countries=countries, count_dma=14,
    factors=['temp', 'uvb', 'rhum', 'A65PLUSB', 'A75PLUSB', 'A05_24B'], factor_dmas={'temp': 14, 'uvb': 14},
    start_hurdle=10, start_factor='cases', 
)
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=139.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))

Here 4 charts are prepared in quick succession.

Additional options are shown for editing the background grey and removing gridlines.

NOTE: CompChart4D automatically sorts the regions on the x-axis such that the regions with the greatest z-axis values are furthest away. This improves readability.

kwargs = {
    'regions': '',
    'ylabel_params': {'fontsize': 18, 'labelpad': 12},
    'zlabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 12},
    'ytick_params': {'labelsize': 12},
    'ztick_params': {'labelsize': 12},
    'title': {'x': 0.58, 'y': 0.825,'s': 'Daily Deaths in Select US States', 'fontsize': 22, 'rotation': -10.7},
    'xy_cbar': (0.09, .225), 'wh_cbar': (.01, 20),
    'title': {'s': 'Most Impacted States in US', 'x': .47, 'y': .74, 'fontsize': 24, 'rotation': -9, 'weight': 'demi'},
    'cblabel_params': {'labelpad': -55},
    'color_category': 'temp_dma', 'color_interval': (20,30),
    'tight': True,
    'comp_size': 30,
    'rank_category': 'deaths_new_dma_per_1M',    
}

p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_person_per_city_KM2', **kwargs)

kwargs['color_category'] = 'uvb_dma'
kwargs['color_interval'] = ()
kwargs['gridlines'] = False

p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)
p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_person_per_city_KM2', **kwargs)

png

6.3 How Far Can We Take It?

101 Most Impacted Regions Globally

I acknowledge that using the chart in this way stretches its value, however, it is has been a great way for me to consider trends globally. Try not to look at each individual region ... look at it more like a scatter plot and see what patterns you can identify, if any.

NOTE: If the number of regions exceeds 100, the region labels are removed automatically.

First, we sort the regions in the baseframe to find the 101 most populous.

Then, those regions are ranked on the comp_category.

compsize = 102
regions = bf[~(bf['country'] == 'China')].sort_values(by='population', ascending=False).region_name.unique().tolist()[:compsize]

factors = ['temp']
factor_dmas = {'temp': 7}

casestudy = CaseStudy(
    bf, regions=regions, factors=factors, factor_dmas=factor_dmas,
    start_hurdle=10, start_factor='cases', count_dma=3, lognat=True
)
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=226.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=103.0), HTML(value='')))

kwargs = {
    'ylabel_params': {'fontsize': 18, 'labelpad': 12},
    'zlabel_params': {'fontsize': 18, 'labelpad': 10},
    'xtick_params': {'labelsize': 12},
    'ytick_params': {'labelsize': 12},
    'ztick_params': {'labelsize': 12},
    'title': {'x': 0.58, 'y': 0.825,'s': 'Daily Deaths Globally', 'fontsize': 22, 'rotation': -10.7},
    'xy_cbar': (0.09, .225), 'wh_cbar': (.01, 20),
    'title': {'s': 'Most Impacted Regions Totally', 'x': .47, 'y': .74, 'fontsize': 24, 'rotation': -9, 'weight': 'demi'},
    'cblabel_params': {'labelpad': -55},
    'color_category': 'temp_dma', 'color_interval': (20,30),
    'tight': True,
    'comp_size': 102,
    'rank_category': 'deaths_new_dma_per_1M', 
}

p = casestudy.compchart4d.make(z_category='deaths_new_dma_per_1M', **kwargs)

png

Now, if temperature for some reason did impact the fatality rate associated with COVID19, what we would expect to see is regions at the far end of the x-axis would tend toward the blue end of the color spectrum and regions at the near end of the x-axis would tend towards red.

We would also expect to see regions with higher peaks to have more blue bars on the near-end of the y-axis, or at times earlier in the outbreak.

7. heatmap - Visualizing with Color Maps

7.1 Count Category v Single Factor
7.2 Count Category v Multiple Factors

Hexbins?

See19 utilizes the hexbin module of matplotlib to generate HeatMap-style charts to investigate the impact of different factors on COVID19 virulence.

This is a bit of a repurpose or basterdization from hexbin's intended usage. hexbin is more commonly used as a 2D histogram for very large datasets, counting the appearance of datapoints within a range of certain (x,y) coordinates (called bins) and then mapping a color scheme to the range of counts.

For our purposes, use of hexbin is a stylistic choice, with the patterns developed more interesting and a bit more revealing than a scatter plot. The intention is for each bin to contain only one datapoint and the color is mapped to either the x-axis values or a 3rd dimension of values.

Structure

As with previous charts, heatmaps are available in CaseStudy via the heatmap attribute, which is in turn an instance of the HeatMap class.

Charts are generated via the make method, which further morphs casestudy.df to arrange data for visualization.

Average over Time v Daily Points

All of the analysis to this point has considered each daily datapoint for each region separately. heatmap is different. heatmap takes (at this point) a simple mean of the x_category and y_category in question. This is a sufficient method to explore potential relationships, but true time series analysis must also be considered to project COVID19 virulence forward.

While the average is used, the timing of such average can still have an impact on the relevance of the analysis. At this stage, heatmap is capable of utilizing the daily moving average from the date of the peak of the x_category or from the date the region clears the start_hurdle.

This option is denoted as the x_start and color_start parameters in the make method.

For this analysis, we need a large dataset, so will start with the top 250 regions in terms of population and we will add many different factors.

excluded_countries = ['China']
excluded_regions = []

frame_filter = (~bf['country'].isin(excluded_countries)) & (~bf['region_name'].isin(excluded_regions))
regions = bf[frame_filter] \
    .sort_values('population', ascending=False) \
    .region_name.unique().tolist()[:250]

factors_with_dmas = CaseStudy.MSMTS + ['strindex']
factor_dmas = {factor: 28 for factor in factors_with_dmas}
factor_dmas['strindex'] = 14
factors = factors_with_dmas + CaseStudy.MAJOR_CAUSES + ['visitors', 'A75PLUSB', 'A65PLUSB', 'gdp']

casestudy = CaseStudy(
    bf, regions=regions, count_dma=14, factors=factors, 
    factor_dmas=factor_dmas, start_hurdle=1, start_factor='deaths', log=True, lognat=True,
)
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=548.0, style=ProgressStyle(description_widt…



HBox(children=(FloatProgress(value=0.0, max=230.0), HTML(value='')))

7.1 Count Category v Single Factor

`heatmap` takes a similar set of options as `comp_chart` and `comp_chart4d`. The biggest difference in approach relates to text annotations:

In comp_chart and comp_chart4d, specific variables for title, subtitle, etc. generate text boxes for specific purposes.
In heatmap this is replaced in favor of a more flexible approach of ad-hoc text annotations via the annotations parameter.
heatmap has tended to require more lengthy notations / explanations and so this approach seemed more appropriate.

In addition to the standard comp_category, the x-axis of heatmap is now provided by the comp_factor parameter.

The below chart is completed on a linear scale of daily fatalities. It hints at a potential relationship between fatalities and temperature for the most impacted regions, however, the scaling is negatively impacted by a handful of outliers.

NOTE: color_factor is not provided, therefore, the color map is a function of the comp_factor values (on the x-axis).

Max Fatalities v Temperature

title = 'Max Daily Fatalities v Temperature by Region'
subtitle = '*Average temperature for two weeks prior to day of 3rd fatality'
note = '**{} Regions considered excluding mainland China'.format(casestudy.df.region_id.unique().shape[0])
kwargs = {
    'x_category': 'deaths_new_dma_per_1M',
    'y_category': 'temp_dma',
    'annotations': [
        [0, 1.09, title, {'color': 'black', 'fontsize': 16, 'ha': 'left', 'va': 'center',}],
        [0, 1.05, subtitle, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
        [0, 1.01, note, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
    ],
    'xtick_params': {'size': 12},
    'ytick_params': {'size': 12},
    'xlabel_params': {'size': 12},
    'ylabel_params': {'size': 16},
    'width': 12, 'height': 8,
}
plt = casestudy.heatmap.make(**kwargs)

png

The root data for the chart is available via df_chart attribute.

casestudy.heatmap.df_chart.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	region_name	temp_dma	deaths_new_dma_per_1M
9	52	Idaho	20.192015	0.428860
69	312	Bahrain	33.111273	0.274820
48	98	Nebraska	26.321220	0.240344
214	563	Mato Grosso Do Sul	23.137148	0.224056
219	568	Sergipe	26.239815	0.215220

Natural Log of Max Fatalities v Temperature

By taking the natural log of the fatality rate, we can scale the figure to reveal a more (potentially) clear relationship.

Viewers often struggle to understand the scaling of a natural log, so an hlines option has been provided that will create horizontal lines at the y-values input. hlines requires a list of y-values.

Text annotations are then included to inform of the unscaled comp_category value at each hline.

We also provide comp_factor_start: as max, which puts to use the 28DMA on the day of peak fatalitiy rate for each region.

title = 'Max Daily Fatalities v Temperature by Region'
kwargs = {
    'x_category': 'deaths_new_dma_per_1M_log',
    'y_category': 'temp_dma',
    'x_start': 'start_hurdle',
    'annotations': [
        [0, 1.09, title, {'color': 'black', 'fontsize': 16, 'ha': 'left', 'va': 'center',}],
        [0, 1.05, subtitle, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
        [0, 1.01, note, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
    ],
    'xtick_params': {'size': 12},
    'ytick_params': {'size': 12},
    'xlabel_params': {'size': 12, 'labelpad': 10},
    'ylabel_params': {'size': 16},
    'width': 12, 'height': 8,
}
plt = casestudy.heatmap.make(**kwargs)

png

As with the other chart instances, a chart-specific dataframe can be access for heatmap via the df_hm attribute.

casestudy.heatmap.df_chart.head(4)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	region_id	region_name	temp_dma	deaths_new_dma_per_1M_log
9	52	Idaho	20.192015	-0.367684
69	312	Bahrain	33.111273	-0.560952
48	98	Nebraska	26.321220	-0.619168
214	563	Mato Grosso Do Sul	23.137148	-0.649644

Lognat of Max Daily New Fatalities and UVB Radition

title = 'Max Daily Fatalities v UVB Radiation by Region'
subtitle = '*Color-mapped by average daily uvb radiation for two weeks prior to the day of max fatalities'
kwargs = {
    'x_category': 'cases_new_dma_per_person_per_city_KM2_log',
    'y_category': 'uvb_dma',
    'x_start': 'max',
    'annotations': [
        [0, 1.09,  title, {'color': 'black', 'fontsize': 16, 'ha': 'left', 'va': 'center',}],
        [0, 1.05, subtitle, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
    ],
    'xtick_params': {'size': 12},
    'ytick_params': {'size': 12},
    'xlabel_params': {'size': 12, 'labelpad': 10},
    'ylabel_params': {'size': 16},
    'width': 12, 'height': 8,
}
plt = casestudy.heatmap.make(**kwargs)

png

7.2 Count Category v Multiple Factors (w one factor color-mapped)

The heatmap is made all the more powerful when a second factor is used to map the color space of the chart.

This is done via the color_factor parameter, which can be adapted via the color_factor_start parameter to take place on the day the start_hurdle is cleared or the day of max count category.

title = 'Max Daily Fatalities v UVB Radiation v Oxford Stringency Index'
subtitle = '*Average UVB radiation and Oxford Stringency Index for two weeks prior to day of 1st fatality'
kwargs = {
    'x_category': 'cases_new_dma_per_1M_lognat',
    'color_category': 'strindex_dma',
    'color_start': 'start_hurdle',
    'y_category': 'uvb_dma',
    'annotations': [
        [0, 1.09, title, {'color': 'black', 'fontsize': 16, 'ha': 'left', 'va': 'center',}],
        [0, 1.05, subtitle, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
    ],
    'xtick_params': {'size': 12},
    'ytick_params': {'size': 12},
    'xlabel_params': {'size': 12, 'labelpad': 10},
    'ylabel_params': {'size': 16},
    'width': 12, 'height': 8,
}
plt = casestudy.heatmap.make(**kwargs)

png

The heatmap approach is even better suited to time-static variables like demographic age ranges, given they are not susceptible to issues around averages over time.

Below we compare A75PLUBB_% against the average strindex for the 14 days prior to the max fatalitiy rate.

We can see that social distancing stringency was quite common across the spectrum and that population age was a much more important variable impacting fatalities.

title = 'Max Daily Fatalities v UVB Radiation v Oxford Stringency Index'
subtitle = '*Average UVB radiation and Oxford Stringency Index for two weeks prior to day of 1st fatality'
note = '**Excludes mainland China'

kwargs = {
    'x_category': 'deaths_new_dma_per_person_per_city_KM2_lognat',
    'y_category': 'A75PLUSB_%',
    'color_category': 'strindex_dma',
    'color_start': 'max',
    'annotations': [
        [0, 1.095, title, {'color': 'black', 'fontsize': 16, 'ha': 'left', 'va': 'center',}],
        [0, 1.055, subtitle, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
        [0, 1.015, note, {'color': 'black', 'fontsize': 12, 'ha': 'left', 'va': 'center', 'style': 'italic'}],
    ],
    'xtick_params': {'size': 12},
    'ytick_params': {'size': 12},
    'xlabel_params': {'size': 12, 'labelpad': 10},
    'ylabel_params': {'size': 16},
    'width': 12, 'height': 8,
}
plt = casestudy.heatmap.make(**kwargs)

png

8. barcharts - Comparing Regional Factors

A barcharts attribute is available (via BarCharts class) as another handy feature for comparing the impact in different regions across different categories.

The object plots a single category on a single plot comparing multiple regions. You can provide multiple categories and multiple subplots will be returned!

barcharts object utilizes matplotlib.

First instantiate the casestudy. We will consider a couple of the more successful Asian regions.

dragons = ['Hong Kong', 'Taiwan', 'Korea, South', 'Japan']
notables = [ 'Texas', 'New York', 'Lombardia', 'Sao Paulo']
regions = notables + dragons

factors_with_dmas = ['uvb', 'temp'] + CaseStudy.STRINDEX_CATS
factor_dmas = {factor: 28 for factor in factors_with_dmas}
mobi_dmas = {'transit': 28, 'retail_n_rec': 28, 'parks': 28, 'workplaces': 28}
factors = factors_with_dmas + CaseStudy.GMOBIS + ['A15_34B', 'A65PLUSB'] \
    + ['visitors', 'gdp'] + CaseStudy.MAJOR_CAUSES

casestudy = CaseStudy(
    bf, regions=regions, count_dma=21, factors=factors, factor_dmas=factor_dmas, 
    mobi_dmas=mobi_dmas, start_hurdle=1, start_factor='deaths',
    favor_earlier=True, factors_to_favor_earlier='key3_sum',
)
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=20.0, style=ProgressStyle(description_width…



HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))

Barcharts accepts any category in the see19 dataset bar_colors provides different coloring of groups in the chart. You can further indicate some feature regions. Below we see a start difference among the regions selected.

factors1 = ['cases_per_1M', 'deaths_per_1M']
kwargs = {'categories': factors1, 'height': 5, 'bar_colors': ['#3D7068', '#D4AFB9', '#529FD7']}
kwargs['feature_regions'] = ['HKG', 'TWN', 'KOR']
plt = casestudy.barcharts.make(**kwargs)

png

Once again, the chart data is available via df_chart:

casestudy.barcharts.df_chart

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

region_code	NY	SP	LOM	TX	JPN	KOR	HKG	TWN
region_id	75	556	36	67	429	433	353	497
region_code	NY	SP	LOM	TX	JPN	KOR	HKG	TWN
cases	407326	416434	95548	332434	25706	13816	1655	451
deaths	25056	19788	16796	4020	988	296	10	7
tests	5.16481e+06	1.15885e+06	724365	2.98455e+06	639821	1.44335e+06	442256	79506
population	1.93781e+07	4.1142e+07	9.63118e+06	2.51456e+07	1.28057e+08	4.79908e+07	7.02728e+06	2.25314e+07
city_dens	13978.1	8184.1	2316.88	924.007	8440.43	5032.81	9261.85	7919.49
cases_per_1M	21019.9	10121.9	9920.7	13220.4	200.738	287.889	235.511	20.0165
deaths_per_1M	1293.01	480.969	1743.92	159.869	7.71529	6.16785	1.42303	0.310678

barcharts can compare daily case and fatality rates. When a daily figure is selected, barcharts will find the maximum value in the time-series.

factors2 = ['deaths_new_dma_per_1M', 'deaths_new_dma_per_person_per_city_KM2']
kwargs = {'categories': factors2, 'height': 5, 'bar_colors': ['#3D7068', '#D4AFB9', '#529FD7']}
kwargs['feature_regions'] = ['HKG', 'TWN', 'KOR']
plt = casestudy.barcharts.make(**kwargs)

png

As a matter of convenience, barcharts will automatically structure a subplot grid for any number of categories greater than 2.

factors = [
    'strindex_dma', 'tests_new_dma_per_1M', 
    'population', 'city_dens', 
    'A15_34B_%', 'A65PLUSB_%', 
    'temp_dma', 'uvb_dma',
    'circul_%', 'endo_%',
    'visitors_%'
]
factors = factors1 + factors2 + factors
kwargs = {'categories': factors, 'height': 50, 'bar_colors': ['#3D7068', '#D4AFB9', '#529FD7']}
kwargs['title'] = {'t': 'COVID Dragons v Other Regions', 'y': .895, 'fontsize': 20, 'fontweight': 'demi'}
kwargs['feature_regions'] = ['HKG', 'TWN', 'KOR']
plt = casestudy.barcharts.make(**kwargs)

png

9. Scatterflow for Large Sets

9.1 SubStrindexScatter
9.2 ScatterFlow

The plots investigated above have limitations when investigating a large set of subjects. Multi-line plots tend to become unreadable when using more than, say, 5 lines, and bar charts have dimensionality limitations, etc.

The scatterflow and substrinscat charts were created to improve visualization in this case.

9.1 substrinscat - for Strindex Sub-Categories

We will start with substrinscat, which is a more specific case of a scatterflow that focuses on the Oxford Stringency Index (you can think of it as being short for "Sub-Strindex Category Scatterflow").

We can generate a single substrinscat for one region that shows each stringency indicator. The value of the indicator is denoted by the color at each point.

The strindex and its subcategories are tracked at the country-level, so we will instantiate a casestudy setting the country_level flag to true. This aggregates all the see19 data up from the province/state level to the country level (where province/state data exists). As previously noted, smoothing is not available when country_level=True.

NOTE we will also instantiate with start_factor: ''. This creates a dataset beginning on 2020-01-01.

factors = CaseStudy.STRINDEX_CATS
factor_dmas = {factor: 28 for factor in factors}

countries = ['United States of America (the)', 'Canada', 'Mexico', 'Brazil', 'Australia', 'Russia',
 'Italy', 'Germany', 'Spain', 'Singapore', 'Japan', 'Hong Kong', 'TWN', 'KOR', 'Malaysia'
]
custom_sum = ['h1', 'h2', 'h3', 'c1', 'c8']
casestudy = CaseStudy(
    bf, countries=countries, count_dma=21, factors=factors, factor_dmas=factor_dmas, 
    start_hurdle=1, start_factor='', lognat=True, country_level=True, custom_sum=custom_sum,
)
casestudy.make()

/Users/spindicate/Documents/programming/zooscraper/casestudy/see19/see19/study/ray.py:16: UserWarning: smoothing is unavailable when country_level=True
  super().__init__(*args, **kwargs)



HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, max=14.0), HTML(value='')))

First, we'll demonstrate a single region, using Japan.

kwargs = {
    'regions': 'Japan', 'width': 6, 'height': 4.5, 
    'title': {'t': 'Japan Stringency Categories', 'x': .57, 'y': 1.07, 'fontsize': 20},
    'xlabel_params': {'fontsize': 18, 'labelpad': 12},
    'cblabel_params': {'fontsize': 14, 'labelpad': 6},
    'palette_base': 'RdPu',
    'xy_cbar': (1.05, .15), 'wh_cbar': (.35, .5),
}
plt = casestudy.substrinscat.make(**kwargs)

png

The single plot above expands to multi-plot simply by adding more regions.

kwargs = {
    'regions': ['name_for_USA', 'Hong Kong', 'Taiwan', 'Korea, South', 'Malaysia'], 
    'width': 14, 'height': 8,
    'palette_base': 'RdPu',
    'xy_cbar': (1.05, .3), 'wh_cbar': (.35, .5),
    'xy_legend': (-.04, .49),
    'legend': {'title': {'fontsize': 12}, 'text': {'fontsize': 12}},
}
plt = casestudy.substrinscat.make(**kwargs)

png

And the plot automatically rescales based on the number of regions considered:

kwargs = {
    'width': 20, 'height': 18, 
    'palette_base': 'RdPu',
    'xy_cbar': (1.05, .3), 'wh_cbar': (.35, .5),
    'xy_legend': (-.04, .51),
    'legend': {'title': {'fontsize': 12}, 'text': {'fontsize': 12}},
}
plt = casestudy.substrinscat.make(**kwargs)

png

9.2 scatterflow

ScatterFlow, available as the scatterflow attribute, is a generalization of the SubStrinScatter chart. It is best suited for comparing many regions along a single dimension. For example, we can compare countries on the core Oxford Stringency Index:

kwargs = {
    'y_category': 'strindex',
    'title': {'t': 'Oxford Stringency Index Over Time', 'y': 0.94, 'fontsize': 16},
    'width': 8, 'height': 6,
    'xy_cbar': (.7, .24), 'wh_cbar': (.35, 1),
    'palette_base': 'Blues',
    'xlabel_params': {'fontsize': 15, 'labelpad': 12},
}

plt = casestudy.scatterflow.make(**kwargs)

png

We can very clearly above the trends in stringency in the different regions above and isolate quickly the outliers.

Scatterflow accepts any category in the see19 database.

Here we show the sum of the Key3 strindex subcategories.

kwargs = {
    'y_category': 'key3_sum',
    'title': {
        't': 'The Key 3: Information, Contact Tracing, and Testing Over Time',
        'fontsize': 16,
        'y': 0.94
    },
    'xlabel_params': {'fontsize': 14},
    'width': 8, 'height': 6,
    'xy_cbar': (.7, .24), 'wh_cbar': (.35, 1),
    'palette_base': 'Blues'
}
plt = casestudy.scatterflow.make(**kwargs)

png

And below we compare US states on new fatalities.

First, we will select the 25 most impacted States in terms of total fatalities. Then, we instantiate a new CaseStudy to do so.

region_ids = bf[bf.country_code == 'USA'].groupby('region_id').deaths.max().sort_values(ascending=False).index.values[:25]

casestudy = CaseStudy(bf, regions=region_ids, count_dma=3,
    start_factor='date', start_hurdle=dt(2020, 3, 1)
)
casestudy.make()

HBox(children=(FloatProgress(value=0.0, description='Creating CaseStudy', layout=Layout(flex='2'), max=2.0, st…



HBox(children=(FloatProgress(value=0.0, description='changes', max=66.0, style=ProgressStyle(description_width…



HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))

kwargs = {
    'y_category': 'deaths_new_dma_per_1M',
    'title': {
        't': 'Daily Fatalities in US States',
        'fontsize': 16,
        'y': 0.94
    },
    'marker': 's',
    'ms': 225,
    'width': 5, 
    'height': 4,
    'xlabel_params': {'fontsize': 14},
    'width': 8, 'height': 6,
    'xy_cbar': (.7, .24), 'wh_cbar': (.35, 1),
    'palette_base': 'RdYlGn_r'
}
casestudy.scatterflow.make(**kwargs)

png

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4rc0 pre-release

Aug 2, 2020

0.4b0 pre-release

Aug 2, 2020

This version

0.4a0 pre-release

Aug 2, 2020

0.3.5

Jun 12, 2020

0.3.3

Jun 7, 2020

0.3.2

May 31, 2020

0.3.1

May 31, 2020

0.3.0

May 31, 2020

0.2.0

May 11, 2020

0.1.8

May 8, 2020

0.1.7

May 6, 2020

0.1.6

May 4, 2020

0.1.5

May 4, 2020

0.1.4

May 4, 2020

0.1.3

May 4, 2020

0.1.0

Jun 13, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

see19-0.4a0.tar.gz (116.7 kB view hashes)

Uploaded Aug 2, 2020 Source

Built Distribution

see19-0.4a0-py3-none-any.whl (71.3 kB view hashes)

Uploaded Aug 2, 2020 Python 3

Hashes for see19-0.4a0.tar.gz

Hashes for see19-0.4a0.tar.gz
Algorithm	Hash digest
SHA256	`b24881a69e1f62bc9a42a0ed42520183a4b63875564619271dda63776d6820d2`
MD5	`de5c7820f902b1f228a494a020aeef0a`
BLAKE2b-256	`f7d78872cab6471422d67ec024aba267e524ca5fac7841f66177f1094322e734`

Hashes for see19-0.4a0-py3-none-any.whl

Hashes for see19-0.4a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c0c6e69a9e1167852e1312aa6b496b1634812919a868b296321e75021e67d88`
MD5	`716fd7e40a8d9afdbf3b2db6ea06a46c`
BLAKE2b-256	`de2345c376e04f34c779fee06859f8e458fe4391221228008cee00289e1f23ba`