Core data types used by OWID for managing data.
Project description
owid-catalog
A Pythonic API for working with OWID's data catalog.
Status: experimental, APIs likely to change
Overview
Our World in Data is building a new data catalog, with the goal of our datasets being reproducible and transparent to the general public. That project is our etl, which going forward will contain the recipes for all the datasets we republish.
This library allows you to query our data catalog programmatically, and get back data in the form of Pandas data frames, perfect for data pipelines or Jupyter notebook explorations.
graph TB
etl -->|reads| snapshot[upstream datasets]
etl -->|generates| s3[data catalog]
catalog[owid-catalog-py] -->|queries| s3
We would love feedback on how we can make this library and overall data catalog better. Feel free to send us an email at info@ourworldindata.org, or start a discussion on Github.
Quickstart
Install with pip install owid-catalog. Then you can get data in two different ways.
Charts catalog
This API attempts to give you exactly the data you in a chart on our site.
from owid.catalog import charts
# get the data for one chart by URL
df = charts.get_data('https://ourworldindata.org/grapher/life-expectancy')
Notice that the last part of the URL is the chart's slug, its identifier, in this case life-expectancy. Using the slug alone also works.
df = charts.get_data('life-expectancy')
Data science API
We also curate much more data than is available on our site. To access that in efficient binary (Feather) format, use our data science API.
This API is designed for use in Jupyter notebooks.
from owid import catalog
# look for Covid-19 data, return a data frame of matches
catalog.find('covid')
# load Covid-19 data from the Our World in Data namespace as a data frame
df = catalog.find('covid', namespace='owid').load()
# search is case-insensitive and supports regex by default
catalog.find(table='gdp.*capita')
# use fuzzy search for typo-tolerant matching (sorted by relevance)
catalog.find(table='forest area', fuzzy=True)
catalog.find(dataset='wrld bank', fuzzy=True, threshold=60)
There many be multiple versions of the same dataset in a catalog, each will have a unique path. To easily load the same dataset again, you should record its path and load it this way:
from owid import catalog
path = 'garden/ihme_gbd/2023-05-15/gbd_mental_health_prevalence_rate/gbd_mental_health_prevalence_rate'
rc = catalog.RemoteCatalog()
df = rc[path]
Development
You need Python 3.10+, uv and make installed. Clone the repo, then you can simply run:
# run all unit tests and CI checks
make test
# watch for changes, then run all checks
make watch
Changelog
Click to expand changelog
v0.4.4- Enhanced
find()with better search capabilities:- Case-insensitive search by default (use
case=Truefor case-sensitive) - Regex support enabled by default for
tableanddatasetparameters - New fuzzy search with
fuzzy=True- typo-tolerant matching sorted by relevance - Configurable fuzzy threshold (0-100) to control match strictness
- Case-insensitive search by default (use
- New dependency:
rapidfuzzfor fuzzy string matching
- Enhanced
v0.4.3- Fixed minor bugs
v0.4.0- Highlights
- Support for Python 3.10-3.13 (was 3.11-3.13)
- Drop support for Python 3.9 (breaking change)
- Others
- Deprecate Walden.
- Dependencies: Change
rdataforpyreadr. - Support: indicator dimensions.
- Support: MDIMs.
- Switched from Poetry to UV package manager.
- New decorator
@keep_metadatato propagate metadata in pandas functions.
- Fixes:
Table.apply,groupby.apply, metadata propagation, type hinting, etc.
- Highlights
v0.3.11- Add support for Python 3.12 in
pypackage.toml
- Add support for Python 3.12 in
v0.3.10- Add experimental chart data API in
owid.catalog.charts
- Add experimental chart data API in
v0.3.9- Switch from isort & black & fake8 to ruff
v0.3.8- Pin dataclasses-json==0.5.8 to fix error with python3.9
v0.3.7- Fix bugs.
- Improve metadata propagation.
- Improve metadata YAML file handling, to have common definitions.
- Remove
DatasetMeta.origins.
v0.3.6- Fixed tons of bugs
processing.pymodule with pandas-like functions that propagate metadata- Support for Dynamic YAML files
- Support for R2 alongside S3
v0.3.5- Remove
catalog.frames; useowid-repackpackage instead - Relax dependency constraints
- Add optional
channelargument toDatasetMeta - Stop supporting metadata in Parquet format, load JSON sidecar instead
- Fix errors when creating new Table columns
- Remove
v0.3.4- Bump
pyarrowdependency to enable Python 3.11 support
- Bump
v0.3.3- Add more arguments to
Table.__init__that are often used in ETL - Add
Dataset.update_metadatafunction for updating metadata from YAML file - Python 3.11 support via update of
pyarrowdependency
- Add more arguments to
v0.3.2- Fix a bug in
Catalog.__getitem__() - Replace
mypytype checker bypyright
- Fix a bug in
v0.3.1- Sort imports with
isort - Change black line length to 120
- Add
grapherchannel - Support path-based indexing into catalogs
- Sort imports with
v0.3.0- Update
OWID_CATALOG_VERSIONto 3 - Support multiple formats per table
- Support reading and writing
parquetfiles with embedded metadata - Optional
repackargument when adding tables to dataset - Underscore
| - Get
versionfield fromDatasetMetainit - Resolve collisions of
underscore_tablefunction - Convert
versiontostrand load jsondimensions
- Update
v0.2.9- Allow multiple channels in
catalog.findfunction
- Allow multiple channels in
v0.2.8- Update
OWID_CATALOG_VERSIONto 2
- Update
v0.2.7- Split datasets into channels (
garden,meadow,open_numbers, ...) and make garden default one - Add
.find_latestmethod to Catalog
- Split datasets into channels (
v0.2.6- Add flag
is_publicfor public/private datasets - Enforce snake_case for table, dataset and variable short names
- Add fields
published_byandpublished_atto Source - Added a list of supported and unsupported operations on columns
- Updated
pyarrow
- Add flag
v0.2.5- Fix ability to load remote CSV tables
v0.2.4- Update the default catalog URL to use a CDN
v0.2.3- Fix methods for finding and loading data from a
LocalCatalog
- Fix methods for finding and loading data from a
v0.2.2- Repack frames to compact dtypes on
Table.to_feather()
- Repack frames to compact dtypes on
v0.2.1- Fix key typo used in version check
v0.2.0- Copy dataset metadata into tables, to make tables more traceable
- Add API versioning, and a requirement to update if your version of this library is too old
v0.1.1- Add support for Python 3.8
v0.1.0- Initial release, including searching and fetching data from a remote catalog
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file owid_catalog-0.4.5.tar.gz.
File metadata
- Download URL: owid_catalog-0.4.5.tar.gz
- Upload date:
- Size: 217.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1ca1227e81465be3ae46e5acbaad960dc9b79859c8f23f1b126d6c5015a5662
|
|
| MD5 |
f77519596c38dea0a47a4c6e1f0a1e00
|
|
| BLAKE2b-256 |
1fbfac4ee07f8b5835f2189e8e731f4170835289ebc8512c1438d265a930a724
|
File details
Details for the file owid_catalog-0.4.5-py3-none-any.whl.
File metadata
- Download URL: owid_catalog-0.4.5-py3-none-any.whl
- Upload date:
- Size: 86.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4f781f40aff532663939b10abc5bda3995817e1bf7c8f08444285791461b4fb
|
|
| MD5 |
73eae4f4e0bd6b5eafad163cada89e4c
|
|
| BLAKE2b-256 |
dec61c59d80fab044f7fa6784dd93a7f4c49f9079fb5db98bb7a9c71d8c5660c
|