Skip to main content

A facade to Kaggle data

Project description

Table of Contents

Haggle

A simple facade to Kaggle data.

Haggle: /ˈhaɡəl/

  • an instance of intense argument (as in bargaining)
  • wrangle (over a price, terms of an agreement, etc.)
  • rhymes with Kaggle and is not taken on pypi (well, now it is)

Essentially, instantiate a KaggleDatasets object, and from it...

  • search for datasets from the python console (so much better than having pictures the kaggle website right?)
  • download what you want and start using...
  • ... oh, and it automatically caches the data zip to a local directory
  • ... oh, and all the while it pretends to be a humble dict with owner/dataset keys, and that's the coolest bit.

Install

pip install haggle

You'll need a kaggle api token to use this

If you do, you probably can just start using.

If you don't got get one! Go see this for detailed instructions, it essentially says:

API credentials

To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the 'Account' tab of your user profile (https://www.kaggle.com/<username>/account) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json (on Windows in the location C:\Users\<Windows-username>\.kaggle\kaggle.json - you can check the exact location, sans drive, with echo %HOMEPATH%). You can define a shell environment variable KAGGLE_CONFIG_DIR to change this location to $KAGGLE_CONFIG_DIR/kaggle.json (on Windows it will be %KAGGLE_CONFIG_DIR%\kaggle.json).

For your security, ensure that other users of your computer do not have read access to your credentials. On Unix-based systems you can do this with the following command:

chmod 600 ~/.kaggle/kaggle.json

You can also choose to export your Kaggle username and token to the environment:

export KAGGLE_USERNAME=datadinosaur
export KAGGLE_KEY=xxxxxxxxxxxxxx

In addition, you can export any other configuration value that normally would be in the $HOME/.kaggle/kaggle.json in the format 'KAGGLE_' (note uppercase).
For example, if the file had the variable "proxy" you would export KAGGLE_PROXY and it would be discovered by the client.

Simple example

from py2store.ext.kaggle import KaggleDatasets

rootdir = '/D/Dropbox/_odata/kaggle/zips'  # define where you want the data to be cached/downloaded

s = KaggleDatasets(rootdir)  # make an instance

if 'rtatman/english-word-frequency' in s:
    del s['rtatman/english-word-frequency']  # just to prepare for the demo
list(s)  # see what you have locally
['uciml/human-activity-recognition-with-smartphones',
 'sitsawek/phonetics-articles-on-plos']

Let's search something (you can also search on kaggle, I was kidding about it being lame!)

results = s.search('word frequency')
print(f"{len(results)=}")
list(results)[:10]
len(results)=180





['rtatman/english-word-frequency',
 'yekenot/fasttext-crawl-300d-2m',
 'rtatman/japanese-lemma-frequency',
 'rtatman/glove-global-vectors-for-word-representation',
 'averkij/lingtrain-hungarian-word-frequency',
 'lukevanhaezebrouck/subtlex-word-frequency',
 'facebook/fatsttext-common-crawl',
 'facebook/fasttext-wikinews',
 'facebook/fasttext-english-word-vectors-including-subwords',
 'kushtej/kannada-word-frequency']

Chose what you want? Good, now do this:

v = s['rtatman/english-word-frequency']
type(v)
py2store.slib.s_zipfile.ZipReader

Okay, let's slow down a moment. What happened? What's this ZipReader thingy?

Well, what happened is that this downloaded the zip file of the data for you and saved it in ROOTDIR/rtatman/english-word-frequency.zip. Don't believe me? Go have a look.

But then it also returns this object called ZipReader that points to it.

If you don't like it, you don't have to use it. But I think you should like it.

Look at what it can do!

List the contents of file (that's in the zip... okay there's just one here, it's a bit boring)

list(v)
['unigram_freq.csv']

Retrieve the data for any given file of the zip without ever having to unzip it!

Oh, and still pretending to be a dict.

b = v['unigram_freq.csv']
print(f"b is a {type(b)} and has {len(b)} bytes")
b is a <class 'bytes'> and has 4956252 bytes

Now the data is given in bytes by default, since that's the basis of everything.

From there you can go everywhere. Here for example, say we'd like to go to pandas.DataFrame...

import pandas as pd
from io import BytesIO

df = pd.read_csv(BytesIO(b))
df.shape
(333333, 2)
print(df.head(7).to_string())
  word        count
0  the  23135851162
1   of  13151942776
2  and  12997637966
3   to  12136980858
4    a   9081174698
5   in   8469404971
6  for   5933321709

And as mentioned, it caches the data to your local drive. You know, download, so that the next time you ask for s['rtatman/english-word-frequency'], it'll be faster to get those bytes.

See, let's list the contents of s again and see that we now have that 'rtatman/english-word-frequency' key we didn't have before.

list(s)
['uciml/human-activity-recognition-with-smartphones',
 'rtatman/english-word-frequency',
 'sitsawek/phonetics-articles-on-plos']

Conclusion

This is awesome.

You get any dataset you want by just doing s['owner/dataset'], and start using it right away (or later), and the next time you ask for it, it'll be there at your fingertips.


F.A.Q.

What if I don't want a zip file anymore?

Just delete it, like you do with any file you don't want anymore. You know the one.

Or... you can be cool and do del s['owner/dataset'] for that key (note a key doesn't include the rootdir or the .zip extension), just like you would with a... dict, once again.

Do you have any jupyter notebooks demoing this.

Sure, you can find some here on github.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haggle-0.0.2.tar.gz (9.4 kB view hashes)

Uploaded Source

Built Distribution

haggle-0.0.2-py3-none-any.whl (8.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page