Skip to main content

Core data types used by OWID for managing data.

Project description

build status PyPI version

owid-catalog

A Pythonic API for working with OWID's data catalog.

Status: experimental, APIs likely to change

Quickstart

Install with pip install owid-catalog. Then you can begin exploring the experimental data catalog:

from owid import catalog

# look for Covid-19 data, return a data frame of matches
catalog.find('covid')

# load Covid-19 data from the Our World In Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()

Development

You need Python 3.9+, poetry and make installed. Clone the repo, then you can simply run:

# run all unit tests and CI checks
make test

# watch for changes, then run all checks
make watch

Data types

Catalog

A catalog is an arbitrarily deep folder structure containing datasets inside. It can be local on disk, or remote.

Load the remote catalog

# find the default OWID catalog and fetch the catalog index over HTTPS
cat = RemoteCatalog()

# get a list of matching tables in different datasets
matches = cat.find('population')

# fetch a data frame for a specific match over HTTPS
t = cat.find_one('population', namespace='gapminder')

Datasets

A dataset is a folder of tables containing metadata about the overall collection.

  • Metadata about the dataset lives in index.json
  • All tables in the folder must share a common format (CSV or Feather)

Create a new dataset

# make a folder and an empty index.json file
ds = Dataset.create('/tmp/my_data')
# choose CSV instead of feather for files
ds = Dataset.create('/tmp/my_data', format='csv')

Add a table to a dataset

# serialize a table using the table's name and the dataset's default format (feather)
# (e.g. /tmp/my_data/my_table.feather)
ds.add(table)

Remove a table from a dataset

ds.remove('table_name')

Access a table

# load a table including metadata into memory
t = ds['my_table']

List tables

# the length is the number of datasets discovered on disk
assert len(ds) > 0
# iterate over the tables discovered on disk
for table in ds:
    do_something(table)

Add metadata

# you need to manually save your changes
ds.title = "Very Important Dataset"
ds.description = "This dataset is a composite of blah blah blah..."
ds.save()

Copy a dataset

# copying a dataset copies all its files to a new location
ds_new = ds.copy('/tmp/new_data_path')

# copying a dataset is identical to copying its folder, so this works too
shutil.copytree('/tmp/old_data', '/tmp/new_data_path')
ds_new = Dataset('/tmp/new_data_path')

Tables

Tables are essentially pandas DataFrames but with metadata. All operations on them occur in-memory, except for loading from and saving to disk. On disk, they are represented by tabular file (feather or CSV) and a JSON metadata file.

Make a new table

# same API as DataFrames
t = Table({
    'gdp': [1, 2, 3],
    'country': ['AU', 'SE', 'CH']
}).set_index('country')

Add metadata about the whole table

t.title = 'Very important data'

Add metadata about a field

t.gdp.description = 'GDP measured in 2011 international $'
t.sources = [
    Source(title='World Bank', url='https://www.worldbank.org/en/home')
]

Add metadata about all fields at once

# sources and licenses are actually stored a the field level
t.sources = [
    Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
t.licenses = [
    License('CC-BY-SA-4.0', url='https://creativecommons.org/licenses/by-nc/4.0/')
]

Save a table to disk

# save to /tmp/my_table.feather + /tmp/my_table.meta.json
t.to_feather('/tmp/my_table.feather')

# save to /tmp/my_table.csv + /tmp/my_table.meta.json
t.to_csv('/tmp/my_table.csv')

Load a table from disk

These work like normal pandas DataFrames, but if there is also a my_table.meta.json file, then metadata will also get read. Otherwise it will be assumed that the data has no metadata:

t = Table.read_feather('/tmp/my_table.feather')

t = Table.read_csv('/tmp/my_table.csv')

Changelog

  • v0.2.4
    • Update the default catalog URL to use a CDN
  • v0.2.3
    • Fix methods for finding and loading data from a LocalCatalog
  • v0.2.2
    • Repack frames to compact dtypes on Table.to_feather()
  • v0.2.1
    • Fix key typo used in version check
  • v0.2.0
    • Copy dataset metadata into tables, to make tables more traceable
    • Add API versioning, and a requirement to update if your version of this library is too old
  • v0.1.1
    • Add support for Python 3.8
  • v0.1.0
    • Initial release, including searching and fetching data from a remote catalog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

owid-catalog-0.2.4.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

owid_catalog-0.2.4-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file owid-catalog-0.2.4.tar.gz.

File metadata

  • Download URL: owid-catalog-0.2.4.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.7 Darwin/21.1.0

File hashes

Hashes for owid-catalog-0.2.4.tar.gz
Algorithm Hash digest
SHA256 b0b0c994311c3ec3f53e1da31d780c9861ee6a28a262c0b065792b4e0cc6d590
MD5 1be93dd99a37a73eeaf358db5633c6a2
BLAKE2b-256 732a8b05e597cf1cd39d45e09cc28087700ccf8ba701e9e8753978e3143fa931

See more details on using hashes here.

File details

Details for the file owid_catalog-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: owid_catalog-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.7 Darwin/21.1.0

File hashes

Hashes for owid_catalog-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ec7a86b47fb42e8bae557749579ae7d78158fbcca1e0e2ca8850f7d5f9eb9715
MD5 db9c943ab81aecb89b91b495d77a59bc
BLAKE2b-256 2a28db40a513745ad6fa7cf1eebe07237dbeaa402c3f50c8135642f867cce791

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page