Skip to main content

A medley of things that got coded because there was an itch to do so

Project description

tapyoca

A medley of small projects

parquet_deformations

I'm calling these Parquet deformations but purest would lynch me.

Really, I just wanted to transform one word into another word, gradually, as I've seen in some of Escher's work, so I looked it up, and saw that it's called parquet deformations. The math looked enticing, but I had no time for that, so I did the first way I could think of: Mapping pixels to pixels (in some fashion -- but nearest neighbors is the method that yields nicest results, under the pixel-level restriction).

Of course, this can be applied to any image (that will be transformed to B/W (not even gray -- I mean actual B/W), and there's several ways you can perform the parquet (I like the gif rendering).

The main function (exposed as a script) is mk_deformation_image. All you need is to specify two images (or words). If you want, of course, you can specify:

  • n_steps: Number of steps from start to end image
  • save_to_file: path to file to save too (if not given, will just return the image object)
  • kind: 'gif', 'horizontal_stack', or 'vertical_stack'
  • coordinate_mapping_maker: A function that will return the mapping between start and end. This function should return a pair (from_coord, to_coord) of aligned matrices whose 2 columns are the the (x, y) coordinates, and the rows represent aligned positions that should be mapped.

Examples

Two words...

fit_to_size = 400
start_im = image_of_text('sensor').rotate(90, expand=1)
end_im = image_of_text('meaning').rotate(90, expand=1)
start_and_end_image(start_im, end_im)

png

im = mk_deformation_image(start_im, end_im, 15, kind='h').resize((500,200))
im

png

im = mk_deformation_image(start_im.transpose(4), end_im.transpose(4), 5, kind='v').resize((200,200))
im

png

f = 'sensor_meaning_knn.gif'
mk_deformation_image(start_im.transpose(4), end_im.transpose(4), n_steps=20, save_to_file=f)
display_gif(f)
f = 'sensor_meaning_scan.gif'
mk_deformation_image(start_im.transpose(4), end_im.transpose(4), n_steps=20, save_to_file=f, 
                     coordinate_mapping_maker='scan')
display_gif(f)
f = 'sensor_meaning_random.gif'
mk_deformation_image(start_im.transpose(4), end_im.transpose(4), n_steps=20, save_to_file=f, 
                     coordinate_mapping_maker='random')
display_gif(f)

From a list of words

start_words = ['sensor', 'vibration', 'tempature']
end_words = ['sense', 'meaning', 'detection']
start_im, end_im = make_start_and_end_images_with_words(
    start_words, end_words, perm=True, repeat=2, size=150)
start_and_end_image(start_im, end_im).resize((600, 200))

png

im = mk_deformation_image(start_im, end_im, 5)
im

png

f = 'bunch_of_words.gif'
mk_deformation_image(start_im, end_im, n_steps=20, save_to_file=f)
display_gif(f)

From files

start_im = Image.open('sensor_strip_01.png')
end_im = Image.open('sense_strip_01.png')
start_and_end_image(start_im.resize((200, 500)), end_im.resize((200, 500)))

png

im = mk_deformation_image(start_im, end_im, 7)
im

png

f = 'medley.gif'
mk_deformation_image(start_im, end_im, n_steps=20, save_to_file=f)
display_gif(f)
mk_deformation_image(start_im, end_im, n_steps=20, save_to_file=f, coordinate_mapping_maker='scan')
display_gif(f)

an image and some text

start_im = 'img/waveform_01.png'  # will first look for a file, and if not consider as text
end_im = 'makes sense'

mk_gif_of_deformations(start_im, end_im, n_steps=20, 
                               save_to_file='image_and_text.gif')
display_gif('image_and_text.gif')  

demonys

What do we think about other peoples?

This project is meant to get an idea of what people think of people for different nations, as seen by what they ask google about them.

Here I use python code to acquire, clean up, and analyze the data.

Demonym

If you're like me and enjoy the false and fleeting impression of superiority that comes when you know a word someone else doesn't. If you're like me and go to parties for the sole purpose of seeking victims to get a one-up on, here's a cool word to add to your arsenal:

demonym: a noun used to denote the natives or inhabitants of a particular country, state, city, etc. "he struggled for the correct demonym for the people of Manchester"

Back-story of this analysis

During a discussion (about traveling in Europe) someone said "why are the swiss so miserable". Now, I wouldn't say that the swiss were especially miserable (a couple of ex-girlfriends aside), but to be fair he was contrasting with Italians, so perhaps he has a point. I apologize if you are swiss, or one of the two ex-girlfriends -- nothing personal, this is all for effect.

We googled "why are the swiss so ", and sure enough, "why are the swiss so miserable" came up as one of the suggestions. So we got curious and started googling other peoples: the French, the Germans, etc.

That's the back-story of this analysis. This analysis is meant to get an idea of what we think of peoples from other countries. Of course, one can rightfully critique the approach I'll take to gauge "what we think" -- all three of these words should, but will not, be defined. I'm just going to see what google's current auto-suggest comes back with when I enter "why are the X so " (where X will be a noun that denotes the natives of inhabitants of a particular country; a demonym if you will).

Warning

Again, word of warning: All data and analyses are biased. Take everything you'll read here (and to be fair, what you read anywhere) with a grain of salt. For simplicitly I'll saying things like "what we think of..." or "who do we most...", etc. But I don't really mean that.

Resources

The results

In a nutshell

Below is listed 73 demonyms along with words extracted from the very first google suggestion when you type.

why are the DEMONYM so

afghan    	                eyes beautiful
albanian  	                     beautiful
american  	          girl dolls expensive
australian	                          tall
belgian   	                    fries good
bhutanese 	                         happy
brazilian 	              good at football
british   	     full of grief and despair
bulgarian 	              properties cheap
burmese   	             cats affectionate
cambodian 	                   cows skinny
canadian  	                          nice
chinese   	                       healthy
colombian 	                  avocados big
cuban     	                   cigars good
czech     	                          tall
dominican 	  republic and haiti different
egyptian  	                gods important
english   	                      reserved
eritrean  	                     beautiful
ethiopian 	                     beautiful
filipino  	                         proud
finn      	               shoes expensive
french    	                       healthy
german    	                          tall
greek     	                gods messed up
haitian   	                parents strict
hungarian 	                    words long
indian    	            tv debates chaotic
indonesian	                         smart
iranian   	                     beautiful
israeli   	           startups successful
italian   	                         short
jamaican  	                sprinters fast
japanese  	                        polite
kenyan    	                  runners good
lebanese  	                          rich
malagasy  	                    names long
malaysian 	                   drivers bad
maltese   	                          rude
mongolian 	                  horses small
moroccan  	                rugs expensive
nepalese  	                     beautiful
nigerian  	                          tall
north korean	                      hats big
norwegian 	                 flights cheap
pakistani 	                          fair
peruvian  	               blueberries big
pole      	                  vaulters hot
portuguese	                         short
puerto rican	       and cuban flags similar
romanian  	                     beautiful
russian   	                  good at math
samoan    	                           big
saudi     	                      arrogant
scottish  	                        bitter
senegalese	                          tall
serbian   	                          tall
singaporean	                          rude
somali    	                parents strict
south african	                     plugs big
south korean	                          tall
sri lankan	                          dark
sudanese  	                          tall
swiss     	        good at making watches
syrian    	                families large
taiwanese 	                        pretty
thai      	                        pretty
tongan    	                           big
ukrainian 	                     beautiful
vietnamese	        fiercely nationalistic
welsh     	                          dark
zambian   	                emeralds cheap

Notes:

  • The queries actually have a space after the "so", which matters so as to omit suggestions containing words that start with so.
  • Only the tail of the suggestion is shown -- minus prefix (why are the DEMONYM or why are DEMONYM) as well as the so, where ever it lands in the suggestion. For example, the first suggestion for the american demonym was "why are american dolls so expensive", which results in the "dolls expensive" association.

Who do we most talk/ask about?

The original list contained 217 demonyms, but many of these yielded no suggestions (to the specific query format I used, that is). Only 73 demonyms gave me at least one suggestion. But within those, number of suggestions range between 1 and 20 (which is probably the default maximum number of suggestions for the API I used). So, pretending that the number of suggestions is an indicator of how much we have to say, or how many different opinions we have, of each of the covered nationalities, here's the top 15 demonyms people talk about, with the corresponding number of suggestions (proxy for "the number of different things people ask about the said nationality).

french         20
singaporean    20
german         20
british        20
swiss          20
english        19
italian        18
cuban          18
canadian       18
welsh          18
australian     17
maltese        16
american       16
japanese       14
scottish       14

Who do we least talk/ask about?

Conversely, here are the 19 demonyms that came back with only one suggestion.

somali          1
bhutanese       1
syrian          1
tongan          1
cambodian       1
malagasy        1
saudi           1
serbian         1
czech           1
eritrean        1
finn            1
puerto rican    1
pole            1
haitian         1
hungarian       1
peruvian        1
moroccan        1
mongolian       1
zambian         1

What do we think about people?

Why are the French so...

How would you (if you're (un)lucky enough to know the French) finish this sentence? You might even have several opinions about the French, and any other group of people you've rubbed shoulders with. What words would your palette contain to describe different nationalities? What words would others (at least those that ask questions to google) use?

Well, here's what my auto-suggest search gave me. A set of 357 unique words and expressions to describe the 72 nationalities. So a long tail of words use only for one nationality. But some words occur for more than one nationality. Here are the top 12 words/expressions used to describe people of the world.

beautiful         11
tall              11
short              9
names long         8
proud              8
parents strict     8
smart              8
nice               7
boring             6
rich               5
dark               5
successful         5

Who is beautiful? Who is tall? Who is short? Who is smart?

beautiful      : albanian, eritrean, ethiopian, filipino, iranian, lebanese, nepalese, pakistani, romanian, ukrainian, vietnamese
tall           : australian, czech, german, nigerian, pakistani, samoan, senegalese, serbian, south korean, sudanese, taiwanese
short          : filipino, indonesian, italian, maltese, nepalese, pakistani, portuguese, singaporean, welsh
names long     : indian, malagasy, nigerian, portuguese, russian, sri lankan, thai, welsh
proud          : albanian, ethiopian, filipino, iranian, lebanese, portuguese, scottish, welsh
parents strict : albanian, ethiopian, haitian, indian, lebanese, pakistani, somali, sri lankan
smart          : indonesian, iranian, lebanese, pakistani, romanian, singaporean, taiwanese, vietnamese
nice           : canadian, english, filipino, nepalese, portuguese, taiwanese, thai
boring         : british, english, french, german, singaporean, swiss
rich           : lebanese, pakistani, singaporean, taiwanese, vietnamese
dark           : filipino, senegalese, sri lankan, vietnamese, welsh
successful     : chinese, english, japanese, lebanese, swiss

How did I do it?

I scraped a list of (country, demonym) pairs from a table in http://www.geography-site.co.uk/pages/countries/demonyms.html.

Then I diagnosed these and manually made a mapping to simplify some "complex" entries, such as mapping an entry such as "Irishman or Irishwoman or Irish" to "Irish".

Using the google suggest API (http://suggestqueries.google.com/complete/search?client=chrome&q=), I requested what the suggestions for why are the $demonym so query pattern, for $demonym running through all 217 demonyms from the list above, storing the results for each if the results were non-empty.

Then, it was just a matter of pulling this data into memory, formatting it a bit, and creating a pandas dataframe that I could then interrogate.

Resources you can find here

The code to do this analysis yourself, from scratch here: data_acquisition.py.

The jupyter notebook I actually used when I developed this: 01 - Demonyms and adjectives - why are the french so....ipynb

Note you'll need to pip install py2store if you haven't already.

In the data folder you'll find

  • country_demonym.p: A pickle of a dataframe of countries and corresponding demonyms
  • country_demonym.xlsx: The same as above, but in excel form
  • demonym_suggested_characteristics.p: A pickle of 73 demonyms and auto-suggestion information, including characteristics.
  • what_we_think_about_demonyns.xlsx: An excel containing various statistics about demonyms and their (perceived) characteristics

Agglutinations

Inspired from a tweet from Raymond Hettinger this morning:

Resist the urge to elide the underscore in multiword function or method names

So I wondered...

Gluglus

The gluglu of a word is the number of partitions you can make of that word into words (of length at least 2 (so no using a or i)). (No "gluglu" isn't an actual term -- unless everyone starts using it from now on. But it was inspired from an actual linguistic term.)

For example, the gluglu of newspaper is 4:

newspaper
    new spa per
    news pa per
    news paper

Every (valid) word has gluglu at least 1.

How many standard library names have gluglus at last 2?

108

Here's the list of all of them.

The winner has a gluglu of 6 (not 7 because formatannotationrelativeto isn't in the dictionary)

formatannotationrelativeto
	for mat an not at ion relative to
	for mat annotation relative to
	form at an not at ion relative to
	form at annotation relative to
	format an not at ion relative to
	format annotation relative to

Details

Dictionary

Really it depends on what dictionary we use. Here, I used a very conservative one. The intersection of two lists: The corncob and the google10000 word lists. Additionally, I only kept of those, those that had at least 2 letters, and had only letters (no hyphens or disturbing diacritics).

Diacritics. Look it up. Impress your next nerd date.

Im left with 8116 words. You can find them here.

Standard Lib Names

Surprisingly, that was the hardest part. I know I'm missing some, but that's enough rabbit-holing.

What I did (modulo some exceptions I won't look into) was to walk the standard lib modules (even that list wasn't a given!) extracting (recursively( the names of any (non-underscored) attributes if they were modules or callables, as well as extracting the arguments of these callables (when they had signatures).

You can find the code I used to extract these names here and the actual list there.

covid

Bar Chart Races (applied to covid-19 spread)

The module will show is how to make these:

The script

If you just want to run this as a script to get the job done, you have one here: https://raw.githubusercontent.com/thorwhalen/tapyoca/master/covid/covid_bar_chart_race.py

Run like this

$ python covid_bar_chart_race.py -h
usage: covid_bar_chart_race.py [-h] {mk-and-save-covid-data,update-covid-data,instructions-to-make-bar-chart-race} ...

positional arguments:
  {mk-and-save-covid-data,update-covid-data,instructions-to-make-bar-chart-race}
    mk-and-save-covid-data
                        :param data_sources: Dirpath or py2store Store where the data is :param kinds: The kinds of data you want to compute and save :param
                        skip_first_days: :param verbose: :return:
    update-covid-data   update the coronavirus data
    instructions-to-make-bar-chart-race

optional arguments:
  -h, --help            show this help message and exit

The jupyter notebook

The notebook (the .ipynb file) shows you how to do it step by step in case you want to reuse the methods for other stuff.

Getting and preparing the data

Corona virus data here: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset (direct download: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset/download). It's currently updated daily, so download a fresh copy if you want.

Population data here: http://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=csv

It comes under the form of a zip file (currently named novel-corona-virus-2019-dataset.zip with several .csv files in them. We use py2store (To install: pip install py2store. Project lives here: https://github.com/i2mint/py2store) to access and pre-prepare it. It allows us to not have to unzip the file and replace the older folder with it every time we download a new one. It also gives us the csvs as pandas.DataFrame already.

import pandas as pd
from io import BytesIO
from py2store import kv_wrap, ZipReader  # google it and pip install it
from py2store.caching import mk_cached_store
from py2store import QuickPickleStore
from py2store.sources import FuncReader

def country_flag_image_url():
    import pandas as pd
    return pd.read_csv(
        'https://raw.githubusercontent.com/i2mint/examples/master/data/country_flag_image_url.csv')

def kaggle_coronavirus_dataset():
    import kaggle
    from io import BytesIO
    # didn't find the pure binary download function, so using temp dir to emulate
    from tempfile import mkdtemp  
    download_dir = mkdtemp()
    filename = 'novel-corona-virus-2019-dataset.zip'
    zip_file = os.path.join(download_dir, filename)
    
    dataset = 'sudalairajkumar/novel-corona-virus-2019-dataset'
    kaggle.api.dataset_download_files(dataset, download_dir)
    with open(zip_file, 'rb') as fp:
        b = fp.read()
    return BytesIO(b)

def city_population_in_time():
    import pandas as pd
    return pd.read_csv(
        'https://gist.githubusercontent.com/johnburnmurdoch/'
        '4199dbe55095c3e13de8d5b2e5e5307a/raw/fa018b25c24b7b5f47fd0568937ff6c04e384786/city_populations'
    )

def country_flag_image_url_prep(df: pd.DataFrame):
    # delete the region col (we don't need it)
    del df['region']
    # rewriting a few (not all) of the country names to match those found in kaggle covid data
    # Note: The list is not complete! Add to it as needed
    old_and_new = [('USA', 'US'), 
                   ('Iran, Islamic Rep.', 'Iran'), 
                   ('UK', 'United Kingdom'), 
                   ('Korea, Rep.', 'Korea, South')]
    for old, new in old_and_new:
        df['country'] = df['country'].replace(old, new)

    return df


@kv_wrap.outcoming_vals(lambda x: pd.read_csv(BytesIO(x)))  # this is to format the data as a dataframe
class ZippedCsvs(ZipReader):
    pass
# equivalent to ZippedCsvs = kv_wrap.outcoming_vals(lambda x: pd.read_csv(BytesIO(x)))(ZipReader)
# Enter here the place you want to cache your data
my_local_cache = os.path.expanduser('~/ddir/my_sources')
CachedFuncReader = mk_cached_store(FuncReader, QuickPickleStore(my_local_cache))
data_sources = CachedFuncReader([country_flag_image_url, 
                                 kaggle_coronavirus_dataset, 
                                 city_population_in_time])
list(data_sources)
['country_flag_image_url',
 'kaggle_coronavirus_dataset',
 'city_population_in_time']
covid_datasets = ZippedCsvs(data_sources['kaggle_coronavirus_dataset'])
list(covid_datasets)
['COVID19_line_list_data.csv',
 'COVID19_open_line_list.csv',
 'covid_19_data.csv',
 'time_series_covid_19_confirmed.csv',
 'time_series_covid_19_confirmed_US.csv',
 'time_series_covid_19_deaths.csv',
 'time_series_covid_19_deaths_US.csv',
 'time_series_covid_19_recovered.csv']
covid_datasets['time_series_covid_19_confirmed.csv'].head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 3/24/20 3/25/20 3/26/20 3/27/20 3/28/20 3/29/20 3/30/20 3/31/20 4/1/20 4/2/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 74 84 94 110 110 120 170 174 237 273
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 123 146 174 186 197 212 223 243 259 277
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 264 302 367 409 454 511 584 716 847 986
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 ... 164 188 224 267 308 334 370 376 390 428
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 ... 3 3 4 4 5 7 7 7 8 8

5 rows × 76 columns

country_flag_image_url = data_sources['country_flag_image_url']
country_flag_image_url.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
country region flag_image_url
0 Angola Africa https://www.countryflags.io/ao/flat/64.png
1 Burundi Africa https://www.countryflags.io/bi/flat/64.png
2 Benin Africa https://www.countryflags.io/bj/flat/64.png
3 Burkina Faso Africa https://www.countryflags.io/bf/flat/64.png
4 Botswana Africa https://www.countryflags.io/bw/flat/64.png
from IPython.display import Image
flag_image_url_of_country = country_flag_image_url.set_index('country')['flag_image_url']
Image(url=flag_image_url_of_country['Australia'])

Update coronavirus data

# To update the coronavirus data:
def update_covid_data(data_sources):
    """update the coronavirus data"""
    if 'kaggle_coronavirus_dataset' in data_sources._caching_store:
        del data_sources._caching_store['kaggle_coronavirus_dataset']  # delete the cached item
    _ = data_sources['kaggle_coronavirus_dataset']

# update_covid_data(data_sources)  # uncomment here when you want to update

Prepare data for flourish upload

import re

def print_if_verbose(verbose, *args, **kwargs):
    if verbose:
        print(*args, **kwargs)
        
def country_data_for_data_kind(data_sources, kind='confirmed', skip_first_days=0, verbose=False):
    """kind can be 'confirmed', 'deaths', 'confirmed_US', 'confirmed_US', 'recovered'"""
    
    covid_datasets = ZippedCsvs(data_sources['kaggle_coronavirus_dataset'])
    
    df = covid_datasets[f'time_series_covid_19_{kind}.csv']
    # df = s['time_series_covid_19_deaths.csv']
    if 'Province/State' in df.columns:
        df.loc[df['Province/State'].isna(), 'Province/State'] = 'n/a'  # to avoid problems arising from NaNs

    print_if_verbose(verbose, f"Before data shape: {df.shape}")

    # drop some columns we don't need
    p = re.compile('\d+/\d+/\d+')

    assert all(isinstance(x, str) for x in df.columns)
    date_cols = [x for x in df.columns if p.match(x)]
    if not kind.endswith('US'):
        df = df.loc[:, ['Country/Region'] + date_cols]
        # group countries and sum up the contributions of their states/regions/pargs
        df['country'] = df.pop('Country/Region')
        df = df.groupby('country').sum()
    else:
        df = df.loc[:, ['Province_State'] + date_cols]
        df['state'] = df.pop('Province_State')
        df = df.groupby('state').sum()

    
    print_if_verbose(verbose, f"After data shape: {df.shape}")
    df = df.iloc[:, skip_first_days:]
    
    if not kind.endswith('US'):
        # Joining with the country image urls and saving as an xls
        country_image_url = country_flag_image_url_prep(data_sources['country_flag_image_url'])
        t = df.copy()
        t.columns = [str(x)[:10] for x in t.columns]
        t = t.reset_index(drop=False)
        t = country_image_url.merge(t, how='outer')
        t = t.set_index('country')
        df = t
    else:    
        pass

    return df


def mk_and_save_country_data_for_data_kind(data_sources, kind='confirmed', skip_first_days=0, verbose=False):
    t = country_data_for_data_kind(data_sources, kind, skip_first_days, verbose)
    filepath = f'country_covid_{kind}.xlsx'
    t.to_excel(filepath)
    print_if_verbose(verbose, f"Was saved here: {filepath}")
# for kind in ['confirmed', 'deaths', 'recovered', 'confirmed_US', 'deaths_US']:
for kind in ['confirmed', 'deaths', 'recovered', 'confirmed_US', 'deaths_US']:
    mk_and_save_country_data_for_data_kind(data_sources, kind=kind, skip_first_days=39, verbose=True)
Before data shape: (262, 79)
After data shape: (183, 75)
Was saved here: country_covid_confirmed.xlsx
Before data shape: (262, 79)
After data shape: (183, 75)
Was saved here: country_covid_deaths.xlsx
Before data shape: (248, 79)
After data shape: (183, 75)
Was saved here: country_covid_recovered.xlsx
Before data shape: (3253, 86)
After data shape: (58, 75)
Was saved here: country_covid_confirmed_US.xlsx
Before data shape: (3253, 87)
After data shape: (58, 75)
Was saved here: country_covid_deaths_US.xlsx

Upload to Flourish, tune, and publish

Go to https://public.flourish.studio/, get a free account, and play.

Got to https://app.flourish.studio/templates

Choose "Bar chart race". At the time of writing this, it was here: https://app.flourish.studio/visualisation/1706060/

... and then play with the settings

Discussion of the methods

from py2store import *
from IPython.display import Image

country flags images

The manual data prep looks something like this.

import pandas as pd

# get the csv data from the url
country_image_url_source = \
    'https://raw.githubusercontent.com/i2mint/examples/master/data/country_flag_image_url.csv'
country_image_url = pd.read_csv(country_image_url_source)

# delete the region col (we don't need it)
del country_image_url['region']

# rewriting a few (not all) of the country names to match those found in kaggle covid data
# Note: The list is not complete! Add to it as needed
# TODO: (Wishful) Using a general smart soft-matching algorithm to do this automatically.
# TODO:    This could use edit-distance, synonyms, acronym generation, etc.
old_and_new = [('USA', 'US'), 
               ('Iran, Islamic Rep.', 'Iran'), 
               ('UK', 'United Kingdom'), 
               ('Korea, Rep.', 'Korea, South')]
for old, new in old_and_new:
    country_image_url['country'] = country_image_url['country'].replace(old, new)

image_url_of_country = country_image_url.set_index('country')['flag_image_url']

country_image_url.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
country flag_image_url
0 Angola https://www.countryflags.io/ao/flat/64.png
1 Burundi https://www.countryflags.io/bi/flat/64.png
2 Benin https://www.countryflags.io/bj/flat/64.png
3 Burkina Faso https://www.countryflags.io/bf/flat/64.png
4 Botswana https://www.countryflags.io/bw/flat/64.png
Image(url=image_url_of_country['Australia'])

Caching the flag images data

Downloading our data sources every time we need them is not sustainable. What if they're big? What if you're offline or have slow internet (yes, dear future reader, even in the US, during coronavirus times!)?

Caching. A "cache aside" read-cache. That's the word. py2store has tools for that (most of which are are caching.py).

So let's say we're going to have a local folder where we'll store various datas we download. The principle is as follows:

from py2store.caching import mk_cached_store

class TheSource(dict): ...
the_cache = {}
TheCacheSource = mk_cached_store(TheSource, the_cache)

the_source = TheSource({'green': 'eggs', 'and': 'ham'})

the_cached_source = TheCacheSource(the_source)
print(f"the_cache: {the_cache}")
print(f"Getting green...")
the_cached_source['green']
print(f"the_cache: {the_cache}")
print("... so the next time the_cached_source will get it's green from that the_cache")
the_cache: {}
Getting green...
the_cache: {'green': 'eggs'}
... so the next time the_cached_source will get it's green from that the_cache

But now, you'll notice a slight problem ahead. What exactly does our source store (or rather reader) looks like? In it's raw form it would take urls as it's keys, and the response of a request as it's value. That store wouldn't have an __iter__ for sure (unless you're Google). But more to the point here, the mk_cached_store tool uses the same key for the source and the cache, and we can't just use the url as is, to be a local file path.

There's many ways we could solve this. One way is to add a key map layer on the cache store, so externally, it speaks the url key language, but internally it will map that url to a valid local file path. We've been there, we got the T-shirt!

But what we're going to do is a bit different: We're going to do the key mapping in the source store itself. It seems to make more sense in our context: We have a data source of name: data pairs, and if we impose that the name should be a valid file name, we don't need to have a key map in the cache store.

So let's start by building this MyDataStore store. We'll start by defining the functions that get us the data we want.

def country_flag_image_url():
    import pandas as pd
    return pd.read_csv(
        'https://raw.githubusercontent.com/i2mint/examples/master/data/country_flag_image_url.csv')

def kaggle_coronavirus_dataset():
    import kaggle
    from io import BytesIO
    # didn't find the pure binary download function, so using temp dir to emulate
    from tempfile import mkdtemp  
    download_dir = mkdtemp()
    filename = 'novel-corona-virus-2019-dataset.zip'
    zip_file = os.path.join(download_dir, filename)
    
    dataset = 'sudalairajkumar/novel-corona-virus-2019-dataset'
    kaggle.api.dataset_download_files(dataset, download_dir)
    with open(zip_file, 'rb') as fp:
        b = fp.read()
    return BytesIO(b)

def city_population_in_time():
    import pandas as pd
    return pd.read_csv(
        'https://gist.githubusercontent.com/johnburnmurdoch/'
        '4199dbe55095c3e13de8d5b2e5e5307a/raw/fa018b25c24b7b5f47fd0568937ff6c04e384786/city_populations'
    )

Now we can make a store that simply uses these function names as the keys, and their returned value as the values.

from py2store.base import KvReader
from functools import lru_cache

class FuncReader(KvReader):
    _getitem_cache_size = 999
    def __init__(self, funcs):
        # TODO: assert no free arguments (arguments are allowed but must all have defaults)
        self.funcs = funcs
        self._func_of_name = {func.__name__: func for func in funcs}

    def __contains__(self, k):
        return k in self._func_of_name
    
    def __iter__(self):
        yield from self._func_of_name
        
    def __len__(self):
        return len(self._func_of_name)

    @lru_cache(maxsize=_getitem_cache_size)
    def __getitem__(self, k):
        return self._func_of_name[k]()  # call the func
    
    def __hash__(self):
        return 1
    
data_sources = FuncReader([country_flag_image_url, kaggle_coronavirus_dataset, city_population_in_time])
list(data_sources)
['country_flag_image_url',
 'kaggle_coronavirus_dataset',
 'city_population_in_time']
data_sources['country_flag_image_url']
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
country region flag_image_url
0 Angola Africa https://www.countryflags.io/ao/flat/64.png
1 Burundi Africa https://www.countryflags.io/bi/flat/64.png
2 Benin Africa https://www.countryflags.io/bj/flat/64.png
3 Burkina Faso Africa https://www.countryflags.io/bf/flat/64.png
4 Botswana Africa https://www.countryflags.io/bw/flat/64.png
... ... ... ...
210 Solomon Islands Oceania https://www.countryflags.io/sb/flat/64.png
211 Tonga Oceania https://www.countryflags.io/to/flat/64.png
212 Tuvalu Oceania https://www.countryflags.io/tv/flat/64.png
213 Vanuatu Oceania https://www.countryflags.io/vu/flat/64.png
214 Samoa Oceania https://www.countryflags.io/ws/flat/64.png

215 rows × 3 columns

data_sources['country_flag_image_url']
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
country region flag_image_url
0 Angola Africa https://www.countryflags.io/ao/flat/64.png
1 Burundi Africa https://www.countryflags.io/bi/flat/64.png
2 Benin Africa https://www.countryflags.io/bj/flat/64.png
3 Burkina Faso Africa https://www.countryflags.io/bf/flat/64.png
4 Botswana Africa https://www.countryflags.io/bw/flat/64.png
... ... ... ...
210 Solomon Islands Oceania https://www.countryflags.io/sb/flat/64.png
211 Tonga Oceania https://www.countryflags.io/to/flat/64.png
212 Tuvalu Oceania https://www.countryflags.io/tv/flat/64.png
213 Vanuatu Oceania https://www.countryflags.io/vu/flat/64.png
214 Samoa Oceania https://www.countryflags.io/ws/flat/64.png

215 rows × 3 columns

data_sources['city_population_in_time']
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
name group year value subGroup city_id lastValue lat lon
0 Agra India 1575 200.0 India Agra - India 200.0 27.18333 78.01667
1 Agra India 1576 212.0 India Agra - India 200.0 27.18333 78.01667
2 Agra India 1577 224.0 India Agra - India 212.0 27.18333 78.01667
3 Agra India 1578 236.0 India Agra - India 224.0 27.18333 78.01667
4 Agra India 1579 248.0 India Agra - India 236.0 27.18333 78.01667
... ... ... ... ... ... ... ... ... ...
6247 Vijayanagar India 1561 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6248 Vijayanagar India 1562 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6249 Vijayanagar India 1563 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6250 Vijayanagar India 1564 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6251 Vijayanagar India 1565 480.0 India Vijayanagar - India 480.0 15.33500 76.46200

6252 rows × 9 columns

But we wanted this all to be cached locally, right? So a few more lines to do that!

from py2store.caching import mk_cached_store
from py2store import QuickPickleStore
    
my_local_cache = os.path.expanduser('~/ddir/my_sources')

CachedFuncReader = mk_cached_store(FuncReader, QuickPickleStore(my_local_cache))
data_sources = CachedFuncReader([country_flag_image_url, kaggle_coronavirus_dataset, city_population_in_time])
list(data_sources)
['country_flag_image_url',
 'kaggle_coronavirus_dataset',
 'city_population_in_time']
data_sources['country_flag_image_url']
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
country region flag_image_url
0 Angola Africa https://www.countryflags.io/ao/flat/64.png
1 Burundi Africa https://www.countryflags.io/bi/flat/64.png
2 Benin Africa https://www.countryflags.io/bj/flat/64.png
3 Burkina Faso Africa https://www.countryflags.io/bf/flat/64.png
4 Botswana Africa https://www.countryflags.io/bw/flat/64.png
... ... ... ...
210 Solomon Islands Oceania https://www.countryflags.io/sb/flat/64.png
211 Tonga Oceania https://www.countryflags.io/to/flat/64.png
212 Tuvalu Oceania https://www.countryflags.io/tv/flat/64.png
213 Vanuatu Oceania https://www.countryflags.io/vu/flat/64.png
214 Samoa Oceania https://www.countryflags.io/ws/flat/64.png

215 rows × 3 columns

data_sources['city_population_in_time']
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
name group year value subGroup city_id lastValue lat lon
0 Agra India 1575 200.0 India Agra - India 200.0 27.18333 78.01667
1 Agra India 1576 212.0 India Agra - India 200.0 27.18333 78.01667
2 Agra India 1577 224.0 India Agra - India 212.0 27.18333 78.01667
3 Agra India 1578 236.0 India Agra - India 224.0 27.18333 78.01667
4 Agra India 1579 248.0 India Agra - India 236.0 27.18333 78.01667
... ... ... ... ... ... ... ... ... ...
6247 Vijayanagar India 1561 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6248 Vijayanagar India 1562 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6249 Vijayanagar India 1563 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6250 Vijayanagar India 1564 480.0 India Vijayanagar - India 480.0 15.33500 76.46200
6251 Vijayanagar India 1565 480.0 India Vijayanagar - India 480.0 15.33500 76.46200

6252 rows × 9 columns

z = ZippedCsvs(data_sources['kaggle_coronavirus_dataset'])
list(z)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tapyoca-0.0.3.tar.gz (96.1 kB view hashes)

Uploaded Source

Built Distribution

tapyoca-0.0.3-py3-none-any.whl (76.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page