Skip to main content

Python library for data.world

Project description

A python library for working with data.world datasets.

This library makes it easy for data.world users to pull and work with data stored on data.world. Additionally, the library provides convenient wrappers for data.world APIs, allowing users to create and update datasets, add and modify files, etc, and possibly implement entire apps on top of data.world.

Quick start

Install

You can install it using pip directly from PyPI:

pip install datadotworld

Optionally, you can install the library including pandas support:

pip install datadotworld[pandas]

If you use conda to manage your python distribution, you can install from the community-maintained [conda-forge](https://conda-forge.github.io/) channel:

conda install -c conda-forge datadotworld-py

Configure

This library requires a data.world API authentication token to work.

Your authentication token can be obtained on data.world once you enable Python under Integrations > Python

To configure the library, run the following command:

dw configure

Alternatively, tokens can be provided via the DW_AUTH_TOKEN environment variable. On MacOS or Unix machines, run (replacing <YOUR_TOKEN>> below with the token obtained earlier):

export DW_AUTH_TOKEN=<YOUR_TOKEN>

Load a dataset

The load_dataset() function facilitates maintaining copies of datasets on the local filesystem. It will download a given dataset’s datapackage and store it under ~/.dw/cache. When used subsequently, load_dataset() will use the copy stored on disk and will work offline, unless it’s called with force_update=True or auto_update=True. force_update=True will overwrite your local copy unconditionally. auto_update=True will only overwrite your local copy if a newer version of the dataset is available on data.world.

Once loaded, a dataset (data and metadata) can be conveniently accessed via the object returned by load_dataset().

Start by importing the datadotworld module:

import datadotworld as dw

Then, invoke the load_dataset() function, to download a dataset and work with it locally. For example:

intro_dataset = dw.load_dataset('jonloyens/an-intro-to-dataworld-dataset')

Dataset objects allow access to data via three different properties raw_data, tables and dataframes. Each of these properties is a mapping (dict) whose values are of type bytes, list and pandas.DataFrame, respectively. Values are lazy loaded and cached once loaded. Their keys are the names of the files contained in the dataset.

For example:

>>> intro_dataset.dataframes
LazyLoadedDict({
    'changelog': LazyLoadedValue(<pandas.DataFrame>),
    'datadotworldbballstats': LazyLoadedValue(<pandas.DataFrame>),
    'datadotworldbballteam': LazyLoadedValue(<pandas.DataFrame>)})

IMPORTANT: Not all files in a dataset are tabular, therefore some will be exposed via raw_data only.

Tables are lists of rows, each represented by a mapping (dict) of column names to their respective values.

For example:

>>> stats_table = intro_dataset.tables['datadotworldbballstats']
>>> stats_table[0]
OrderedDict([('Name', 'Jon'),
             ('PointsPerGame', Decimal('20.4')),
             ('AssistsPerGame', Decimal('1.3'))])

You can also review the metadata associated with a file or the entire dataset, using the describe function. For example:

>>> intro_dataset.describe()
{'homepage': 'https://data.world/jonloyens/an-intro-to-dataworld-dataset',
 'name': 'jonloyens_an-intro-to-dataworld-dataset',
 'resources': [{'format': 'csv',
   'name': 'changelog',
   'path': 'data/ChangeLog.csv'},
  {'format': 'csv',
   'name': 'datadotworldbballstats',
   'path': 'data/DataDotWorldBBallStats.csv'},
  {'format': 'csv',
   'name': 'datadotworldbballteam',
   'path': 'data/DataDotWorldBBallTeam.csv'}]}
>>> intro_dataset.describe('datadotworldbballstats')
{'format': 'csv',
 'name': 'datadotworldbballstats',
 'path': 'data/DataDotWorldBBallStats.csv',
 'schema': {'fields': [{'name': 'Name', 'title': 'Name', 'type': 'string'},
                       {'name': 'PointsPerGame',
                        'title': 'PointsPerGame',
                        'type': 'number'},
                       {'name': 'AssistsPerGame',
                        'title': 'AssistsPerGame',
                        'type': 'number'}]}}

Query a dataset

The query() function allows datasets to be queried live using SQL or SPARQL query languages.

To query a dataset, invoke the query() function. For example:

results = dw.query('jonloyens/an-intro-to-dataworld-dataset', 'SELECT * FROM DataDotWorldBBallStats')

Query result objects allow access to the data via raw_data, table and dataframe properties, of type json, list and pandas.DataFrame, respectively.

For example:

>>> results.dataframe
      Name  PointsPerGame  AssistsPerGame
0      Jon           20.4             1.3
1      Rob           15.5             8.0
2   Sharon           30.1            11.2
3     Alex            8.2             0.5
4  Rebecca           12.3            17.0
5   Ariane           18.1             3.0
6    Bryon           16.0             8.5
7     Matt           13.0             2.1

Tables are lists of rows, each represented by a mapping (dict) of column names to their respective values. For example:

>>> results.table[0]
OrderedDict([('Name', 'Jon'),
             ('PointsPerGame', Decimal('20.4')),
             ('AssistsPerGame', Decimal('1.3'))])

To query using SPARQL invoke query() using query_type='sparql', or else, it will assume the query to be a SQL query.

Just like in the dataset case, you can view the metadata associated with a query result using the describe() function.

For example:

>>> results.describe()
{'fields': [{'name': 'Name', 'type': 'string'},
            {'name': 'PointsPerGame', 'type': 'number'},
            {'name': 'AssistsPerGame', 'type': 'number'}]}

Work with files

The open_remote_file() function allows you to write data to or read data from a file in a data.world dataset.

Writing files

The object that is returned from the open_remote_file() call is similar to a file handle that would be used to write to a local file - it has a write() method, and contents sent to that method will be written to the file remotely.

>>> import datadotworld as dw
>>>
>>> with dw.open_remote_file('username/test-dataset', 'test.txt') as w:
...   w.write("this is a test.")
>>>

Of course, writing a text file isn’t the primary use case for data.world - you want to write your data! The return object from open_remote_file() should be usable anywhere you could normally use a local file handle in write mode - so you can use it to serialize the contents of a PANDAS DataFrame to a CSV file…

>>> import pandas as pd
>>> df = pd.DataFrame({'foo':[1,2,3,4],'bar':['a','b','c','d']})
>>> with dw.open_remote_file('username/test-dataset', 'dataframe.csv') as w:
...   df.to_csv(w, index=False)

Or, to write a series of dict objects as a JSON Lines file…

>>> import json
>>> with dw.open_remote_file('username/test-dataset', 'test.jsonl') as w:
...   json.dump({'foo':42, 'bar':"A"}, w)
...   json.dump({'foo':13, 'bar':"B"}, w)
>>>

Or to write a series of dict objects as a CSV…

>>> import csv
>>> with dw.open_remote_file('username/test-dataset', 'test.csv') as w:
...   csvw = csv.DictWriter(w, fieldnames=['foo', 'bar'])
...   csvw.writeheader()
...   csvw.writerow({'foo':42, 'bar':"A"})
...   csvw.writerow({'foo':13, 'bar':"B"})
>>>

And finally, you can write binary data by streaming bytes or bytearray objects, if you open the file in binary mode…

>>> with dw.open_remote_file('username/test-dataset', 'test.txt', mode='wb') as w:
...   w.write(bytes([100,97,116,97,46,119,111,114,108,100]))

Reading files

You can also read data from a file in a similar fashion

>>> with dw.open_remote_file('username/test-dataset', 'test.txt', mode='r') as r:
...   print(r.read)

Reading from the file into common parsing libraries works naturally, too - when opened in ‘r’ mode, the file object acts as an Iterator of the lines in the file:

>>> with dw.open_remote_file('username/test-dataset', 'test.txt', mode='r') as r:
...   csvr = csv.DictReader(r)
...   for row in csvr:
...      print(row['column a'], row['column b'])

Reading binary files works naturally, too - when opened in ‘rb’ mode, read() returns the contents of the file as a byte array, and the file object acts as an iterator of bytes:

>>> with dw.open_remote_file('username/test-dataset', 'test', mode='rb') as r:
...   bytes = r.read()

Additional API Features

For a complete list of available API operations, see official documentation.

Python wrappers are implemented by the ApiClient class. To obtain an instance, simply call api_client. For example:

client = dw.api_client

The client currently implements the following functions:

  • create_dataset

  • update_dataset

  • replace_dataset

  • get_dataset

  • delete_dataset

  • add_files_via_url

  • append_records

  • upload_files

  • upload_file

  • delete_files

  • sync_files

  • download_dataset

  • download_file

  • get_user_data

  • fetch_contributing_datasets

  • fetch_liked_datasets

  • fetch_datasets

  • fetch_contributing_projects

  • fetch_liked_projects

  • fetch_projects

  • get_project

  • create_project

  • update_project

  • replace_project

  • add_linked_dataset

  • remove_linked_dataset

  • delete_project

  • get_insight

  • get_insights_for_project

  • create_insight

  • replace_insight

  • update_insight

  • delete_insight

  • search_resources

  • create_new_tables

  • create_new_connections

For a few examples of what the ApiClient can be used for, see below.

Add files from URL

The add_files_via_url() function can be used to add files to a dataset from a URL. This can be done by specifying files as a dictionary where the keys are the desired file name and each item is an object containing url, description and labels.

For example:

>>> client = dw.api_client()
>>> client.add_files_via_url('username/test-dataset', files={'sample.xls': {'url':'http://www.sample.com/sample.xls', 'description': 'sample doc', 'labels': ['raw data']}})

Append records to stream

The append_record() function allows you to append JSON data to a data stream associated with a dataset. Streams do not need to be created in advance. Streams are automatically created the first time a streamId is used in an append operation.

For example:

>>> client = dw.api_client()
>>> client.append_records('username/test-dataset','streamId', {'data': 'data'})

Contents of a stream will appear as part of the respective dataset as a .jsonl file.

You can find more about those functions using help(client)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadotworld-1.8.4.tar.gz (159.9 kB view details)

Uploaded Source

Built Distribution

datadotworld-1.8.4-py2.py3-none-any.whl (423.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file datadotworld-1.8.4.tar.gz.

File metadata

  • Download URL: datadotworld-1.8.4.tar.gz
  • Upload date:
  • Size: 159.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for datadotworld-1.8.4.tar.gz
Algorithm Hash digest
SHA256 5c481a39763d0919ececebd071ceb6710760e0ea2a3f9931e29196c42db24cd7
MD5 8d00d0510c76a45a04901cc023fbdf9e
BLAKE2b-256 3ed5128b55e9e939b9b63d789192ed21770a50d37f677d8e0989ed9f3089165a

See more details on using hashes here.

File details

Details for the file datadotworld-1.8.4-py2.py3-none-any.whl.

File metadata

  • Download URL: datadotworld-1.8.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 423.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for datadotworld-1.8.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 76ff465ac7007d675b2db1093c330271951de60c99a25649cc8cc4b19800962b
MD5 195d711094f44785e3cda38d17b6e797
BLAKE2b-256 08d50d793903576809b6ba1b426eca5377bf4c49d69d9b0e4c6e41b51c8bdda3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page