Skip to main content

Core data types used by OWID for managing data.

Project description

Build status PyPI version

owid-catalog

A Pythonic API for working with OWID's data catalog.

Status: experimental, APIs likely to change

Overview

Our World in Data is building a new data catalog, with the goal of our datasets being reproducible and transparent to the general public. That project is our etl, which going forward will contain the recipes for all the datasets we republish.

This library allows you to query our data catalog programmatically, and get back data in the form of Pandas data frames, perfect for data pipelines or Jupyter notebook explorations.

graph TB

etl -->|reads| snapshot[upstream datasets]
etl -->|generates| s3[data catalog]
catalog[owid-catalog-py] -->|queries| s3

We would love feedback on how we can make this library and overall data catalog better. Feel free to send us an email at info@ourworldindata.org, or start a discussion on Github.

Quickstart

Install with pip install owid-catalog. Then you can get data in two different ways.

Charts catalog

This API attempts to give you exactly the data you in a chart on our site.

from owid.catalog import charts

# get the data for one chart by URL
df = charts.get_data('https://ourworldindata.org/grapher/life-expectancy')

Notice that the last part of the URL is the chart's slug, its identifier, in this case life-expectancy. Using the slug alone also works.

df = charts.get_data('life-expectancy')

Data science API

We also curate much more data than is available on our site. To access that in efficient binary (Feather) format, use our data science API.

This API is designed for use in Jupyter notebooks.

from owid import catalog

# look for Covid-19 data, return a data frame of matches
catalog.find('covid')

# load Covid-19 data from the Our World in Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()

# search is case-insensitive and supports regex by default
catalog.find(table='gdp.*capita')

# use fuzzy search for typo-tolerant matching (sorted by relevance)
catalog.find(table='forest area', fuzzy=True)
catalog.find(dataset='wrld bank', fuzzy=True, threshold=60)

There many be multiple versions of the same dataset in a catalog, each will have a unique path. To easily load the same dataset again, you should record its path and load it this way:

from owid import catalog

path = 'garden/ihme_gbd/2023-05-15/gbd_mental_health_prevalence_rate/gbd_mental_health_prevalence_rate'

rc = catalog.RemoteCatalog()
df = rc[path]

Development

You need Python 3.10+, uv and make installed. Clone the repo, then you can simply run:

# run all unit tests and CI checks
make test

# watch for changes, then run all checks
make watch

Changelog

Click to expand changelog
  • v0.4.4
    • Enhanced find() with better search capabilities:
      • Case-insensitive search by default (use case=True for case-sensitive)
      • Regex support enabled by default for table and dataset parameters
      • New fuzzy search with fuzzy=True - typo-tolerant matching sorted by relevance
      • Configurable fuzzy threshold (0-100) to control match strictness
    • New dependency: rapidfuzz for fuzzy string matching
  • v0.4.3
    • Fixed minor bugs
  • v0.4.0
    • Highlights
      • Support for Python 3.10-3.13 (was 3.11-3.13)
      • Drop support for Python 3.9 (breaking change)
    • Others
      • Deprecate Walden.
      • Dependencies: Change rdata for pyreadr.
      • Support: indicator dimensions.
      • Support: MDIMs.
      • Switched from Poetry to UV package manager.
      • New decorator @keep_metadata to propagate metadata in pandas functions.
    • Fixes: Table.apply, groupby.apply, metadata propagation, type hinting, etc.
  • v0.3.11
    • Add support for Python 3.12 in pypackage.toml
  • v0.3.10
    • Add experimental chart data API in owid.catalog.charts
  • v0.3.9
    • Switch from isort & black & fake8 to ruff
  • v0.3.8
    • Pin dataclasses-json==0.5.8 to fix error with python3.9
  • v0.3.7
    • Fix bugs.
    • Improve metadata propagation.
    • Improve metadata YAML file handling, to have common definitions.
    • Remove DatasetMeta.origins.
  • v0.3.6
    • Fixed tons of bugs
    • processing.py module with pandas-like functions that propagate metadata
    • Support for Dynamic YAML files
    • Support for R2 alongside S3
  • v0.3.5
    • Remove catalog.frames; use owid-repack package instead
    • Relax dependency constraints
    • Add optional channel argument to DatasetMeta
    • Stop supporting metadata in Parquet format, load JSON sidecar instead
    • Fix errors when creating new Table columns
  • v0.3.4
    • Bump pyarrow dependency to enable Python 3.11 support
  • v0.3.3
    • Add more arguments to Table.__init__ that are often used in ETL
    • Add Dataset.update_metadata function for updating metadata from YAML file
    • Python 3.11 support via update of pyarrow dependency
  • v0.3.2
    • Fix a bug in Catalog.__getitem__()
    • Replace mypy type checker by pyright
  • v0.3.1
    • Sort imports with isort
    • Change black line length to 120
    • Add grapher channel
    • Support path-based indexing into catalogs
  • v0.3.0
    • Update OWID_CATALOG_VERSION to 3
    • Support multiple formats per table
    • Support reading and writing parquet files with embedded metadata
    • Optional repack argument when adding tables to dataset
    • Underscore |
    • Get version field from DatasetMeta init
    • Resolve collisions of underscore_table function
    • Convert version to str and load json dimensions
  • v0.2.9
    • Allow multiple channels in catalog.find function
  • v0.2.8
    • Update OWID_CATALOG_VERSION to 2
  • v0.2.7
    • Split datasets into channels (garden, meadow, open_numbers, ...) and make garden default one
    • Add .find_latest method to Catalog
  • v0.2.6
    • Add flag is_public for public/private datasets
    • Enforce snake_case for table, dataset and variable short names
    • Add fields published_by and published_at to Source
    • Added a list of supported and unsupported operations on columns
    • Updated pyarrow
  • v0.2.5
    • Fix ability to load remote CSV tables
  • v0.2.4
    • Update the default catalog URL to use a CDN
  • v0.2.3
    • Fix methods for finding and loading data from a LocalCatalog
  • v0.2.2
    • Repack frames to compact dtypes on Table.to_feather()
  • v0.2.1
    • Fix key typo used in version check
  • v0.2.0
    • Copy dataset metadata into tables, to make tables more traceable
    • Add API versioning, and a requirement to update if your version of this library is too old
  • v0.1.1
    • Add support for Python 3.8
  • v0.1.0
    • Initial release, including searching and fetching data from a remote catalog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

owid_catalog-0.4.5.tar.gz (217.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

owid_catalog-0.4.5-py3-none-any.whl (86.2 kB view details)

Uploaded Python 3

File details

Details for the file owid_catalog-0.4.5.tar.gz.

File metadata

  • Download URL: owid_catalog-0.4.5.tar.gz
  • Upload date:
  • Size: 217.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for owid_catalog-0.4.5.tar.gz
Algorithm Hash digest
SHA256 d1ca1227e81465be3ae46e5acbaad960dc9b79859c8f23f1b126d6c5015a5662
MD5 f77519596c38dea0a47a4c6e1f0a1e00
BLAKE2b-256 1fbfac4ee07f8b5835f2189e8e731f4170835289ebc8512c1438d265a930a724

See more details on using hashes here.

File details

Details for the file owid_catalog-0.4.5-py3-none-any.whl.

File metadata

  • Download URL: owid_catalog-0.4.5-py3-none-any.whl
  • Upload date:
  • Size: 86.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for owid_catalog-0.4.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a4f781f40aff532663939b10abc5bda3995817e1bf7c8f08444285791461b4fb
MD5 73eae4f4e0bd6b5eafad163cada89e4c
BLAKE2b-256 dec61c59d80fab044f7fa6784dd93a7f4c49f9079fb5db98bb7a9c71d8c5660c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page