Core data types used by OWID for managing data.

These details have not been verified by PyPI

Project links

Project description

owid-catalog

A Pythonic API for working with OWID's data catalog.

Status: experimental, APIs likely to change

Overview

Our World in Data is building a new data catalog, with the goal of our datasets being reproducible and transparent to the general public. That project is our etl, which going forward will contain the recipes for all the datasets we republish.

This library allows you to query our data catalog programmatically, and get back data in the form of Pandas data frames, perfect for data pipelines or Jupyter notebook explorations.

graph TB

etl -->|reads| walden[upstream datasets]
etl -->|generates| s3[data catalog]
catalog[owid-catalog-py] -->|queries| s3

We would love feedback on how we can make this library and overall data catalog better. Feel free to send us an email at info@ourworldindata.org, or start a discussion on Github.

Quickstart

Install with pip install owid-catalog. Then you can begin exploring the experimental data catalog:

from owid import catalog

# look for Covid-19 data, return a data frame of matches
catalog.find('covid')

# load Covid-19 data from the Our World in Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()

# load data from other than the default `garden` channel
lung_cancer_tables = catalog.find('lung_cancer_deaths_per_100000_men', channels=['open_numbers'])
df = lung_cancer_tables.iloc[0].load()

Development

You need Python 3.8+, poetry and make installed. Clone the repo, then you can simply run:

# run all unit tests and CI checks
make test

# watch for changes, then run all checks
make watch

Data types

Catalog

A catalog is an arbitrarily deep folder structure containing datasets inside. It can be local on disk, or remote.

Load the remote catalog

# find the default OWID catalog and fetch the catalog index over HTTPS
cat = RemoteCatalog()

# get a list of matching tables in different datasets
matches = cat.find('population')

# fetch a data frame for a specific match over HTTPS
t = cat.find_one('population', namespace='gapminder')

# load other channels than `garden`
cat = RemoteCatalog(channels=('garden', 'meadow', 'open_numbers'))

Datasets

A dataset is a folder of tables containing metadata about the overall collection.

Metadata about the dataset lives in index.json
All tables in the folder must share a common format (CSV or Feather)

Create a new dataset

# make a folder and an empty index.json file
ds = Dataset.create('/tmp/my_data')

# choose CSV instead of feather for files
ds = Dataset.create('/tmp/my_data', format='csv')

Add a table to a dataset

# serialize a table using the table's name and the dataset's default format (feather)
# (e.g. /tmp/my_data/my_table.feather)
ds.add(table)

Remove a table from a dataset

ds.remove('table_name')

Access a table

# load a table including metadata into memory
t = ds['my_table']

List tables

# the length is the number of datasets discovered on disk
assert len(ds) > 0

# iterate over the tables discovered on disk
for table in ds:
    do_something(table)

Add metadata

# you need to manually save your changes
ds.title = "Very Important Dataset"
ds.description = "This dataset is a composite of blah blah blah..."
ds.save()

Copy a dataset

# copying a dataset copies all its files to a new location
ds_new = ds.copy('/tmp/new_data_path')

# copying a dataset is identical to copying its folder, so this works too
shutil.copytree('/tmp/old_data', '/tmp/new_data_path')
ds_new = Dataset('/tmp/new_data_path')

Tables

Tables are essentially pandas DataFrames but with metadata. All operations on them occur in-memory, except for loading from and saving to disk. On disk, they are represented by tabular file (feather or CSV) and a JSON metadata file.

Columns of Table have attribute VariableMeta, including their type, description, and unit. Be carful when manipulating them, not all operations are currently supported. Supported are: adding a column, renaming columns. Not supported: direct assignment to t.columns = ... or to index names t.columns.index = ....

Make a new table

# same API as DataFrames
t = Table({
    'gdp': [1, 2, 3],
    'country': ['AU', 'SE', 'CH']
}).set_index('country')

Add metadata about the whole table

t.title = 'Very important data'

Add metadata about a field

t.gdp.description = 'GDP measured in 2011 international $'
t.sources = [
    Source(title='World Bank', url='https://www.worldbank.org/en/home')
]

Add metadata about all fields at once

# sources and licenses are actually stored a the field level
t.sources = [
    Source(title='World Bank', url='https://www.worldbank.org/en/home')
]
t.licenses = [
    License('CC-BY-SA-4.0', url='https://creativecommons.org/licenses/by-nc/4.0/')
]

Save a table to disk

# save to /tmp/my_table.feather + /tmp/my_table.meta.json
t.to_feather('/tmp/my_table.feather')

# save to /tmp/my_table.csv + /tmp/my_table.meta.json
t.to_csv('/tmp/my_table.csv')

Load a table from disk

These work like normal pandas DataFrames, but if there is also a my_table.meta.json file, then metadata will also get read. Otherwise it will be assumed that the data has no metadata:

t = Table.read_feather('/tmp/my_table.feather')

t = Table.read_csv('/tmp/my_table.csv')

Changelog

dev
v0.3.6
- Fixed tons of bugs
- processing.py module with pandas-like functions that propagate metadata
- Support for Dynamic YAML files
- Support for R2 alongside S3
v0.3.5
- Remove catalog.frames; use owid-repack package instead
- Relax dependency constraints
- Add optional channel argument to DatasetMeta
- Stop supporting metadata in Parquet format, load JSON sidecar instead
- Fix errors when creating new Table columns
v0.3.4
- Bump pyarrow dependency to enable Python 3.11 support
v0.3.3
- Add more arguments to Table.__init__ that are often used in ETL
- Add Dataset.update_metadata function for updating metadata from YAML file
- Python 3.11 support via update of pyarrow dependency
v0.3.2
- Fix a bug in Catalog.__getitem__()
- Replace mypy type checker by pyright
v0.3.1
- Sort imports with isort
- Change black line length to 120
- Add grapher channel
- Support path-based indexing into catalogs
v0.3.0
- Update OWID_CATALOG_VERSION to 3
- Support multiple formats per table
- Support reading and writing parquet files with embedded metadata
- Optional repack argument when adding tables to dataset
- Underscore |
- Get version field from DatasetMeta init
- Resolve collisions of underscore_table function
- Convert version to str and load json dimensions
v0.2.9
- Allow multiple channels in catalog.find function
v0.2.8
- Update OWID_CATALOG_VERSION to 2
v0.2.7
- Split datasets into channels (garden, meadow, open_numbers, ...) and make garden default one
- Add .find_latest method to Catalog
v0.2.6
- Add flag is_public for public/private datasets
- Enforce snake_case for table, dataset and variable short names
- Add fields published_by and published_at to Source
- Added a list of supported and unsupported operations on columns
- Updated pyarrow
v0.2.5
- Fix ability to load remote CSV tables
v0.2.4
- Update the default catalog URL to use a CDN
v0.2.3
- Fix methods for finding and loading data from a LocalCatalog
v0.2.2
- Repack frames to compact dtypes on Table.to_feather()
v0.2.1
- Fix key typo used in version check
v0.2.0
- Copy dataset metadata into tables, to make tables more traceable
- Add API versioning, and a requirement to update if your version of this library is too old
v0.1.1
- Add support for Python 3.8
v0.1.0
- Initial release, including searching and fetching data from a remote catalog

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.11

May 17, 2024

0.3.10

May 17, 2024

0.3.9

Jan 26, 2024

0.3.8

Oct 17, 2023

0.3.7

Oct 16, 2023

This version

0.3.6

Sep 28, 2023

0.3.5

Jul 20, 2023

0.3.4

Dec 22, 2022

0.3.2

Sep 24, 2022

0.3.1

Sep 24, 2022

0.3.0

Aug 10, 2022

0.2.9

May 12, 2022

0.2.8

May 11, 2022

0.2.7

May 4, 2022

0.2.6

Apr 20, 2022

0.2.5

Jan 27, 2022

0.2.4

Dec 13, 2021

0.2.3

Nov 9, 2021

0.2.2

Oct 31, 2021

0.2.1

Oct 26, 2021

0.2.0

Oct 25, 2021

0.1.1

Oct 22, 2021

0.1.0

Oct 22, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

owid_catalog-0.3.6.tar.gz (43.8 kB view details)

Uploaded Sep 28, 2023 Source

Built Distribution

owid_catalog-0.3.6-py3-none-any.whl (45.2 kB view details)

Uploaded Sep 28, 2023 Python 3

File details

Details for the file owid_catalog-0.3.6.tar.gz.

File metadata

Download URL: owid_catalog-0.3.6.tar.gz
Upload date: Sep 28, 2023
Size: 43.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.11.2 Darwin/21.6.0

File hashes

Hashes for owid_catalog-0.3.6.tar.gz
Algorithm	Hash digest
SHA256	`830c7e8133f5e27ef7cd043001d819296d38364be2ce8fefd1c86c1545c98d13`
MD5	`163215db7549142098d6a15d37e78b09`
BLAKE2b-256	`456cf7d68da8be58d3cda2ee649babc3d241f6877a46691532e61ce94e6a7b5c`

See more details on using hashes here.

File details

Details for the file owid_catalog-0.3.6-py3-none-any.whl.

File metadata

Download URL: owid_catalog-0.3.6-py3-none-any.whl
Upload date: Sep 28, 2023
Size: 45.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.11.2 Darwin/21.6.0

File hashes

Hashes for owid_catalog-0.3.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f7d6b54c541ca910314c2df840c788e96d1cb2e51af385c7f943cbfb4a73aad1`
MD5	`b08544086aa8c446b64ce801d815ed95`
BLAKE2b-256	`fd4db3723f228cee54994262a26eed07a5987fc2f38d3fe71eb70db82a47e24f`

See more details on using hashes here.

owid-catalog 0.3.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

owid-catalog

Overview

Quickstart

Development

Data types

Catalog

Load the remote catalog

Datasets

Create a new dataset

Add a table to a dataset

Remove a table from a dataset

Access a table

List tables

Add metadata

Copy a dataset

Tables

Make a new table

Add metadata about the whole table

Add metadata about a field

Add metadata about all fields at once

Save a table to disk

Load a table from disk

Changelog

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes