Skip to main content

Core data types used by OWID for managing data.

Project description

Build status PyPI version

owid-catalog

A Pythonic library for working with OWID data.

The owid-catalog library is the foundation of Our World in Data's data management system. It provides:

  1. Data APIs: Access OWID's published data through unified client interfaces
  2. Data Structures: Enhanced pandas DataFrames with rich metadata support

Installation

pip install owid-catalog

Quick Examples

Accessing OWID Data

from owid.catalog import fetch, search

# Search for charts (default)
charts = search("population")
tb = charts[0].fetch()

# Fetch data from OWID Chart at ourworldindata.org/grapher/life-expectancy
tb = fetch("life-expectancy")

# Search for tables
tables = search("population", kind="table", namespace="un")
tb = tables[0].fetch()

# Search indicators (using semantic search)
search("renewable energy", kind="indicator")

Working with Data Structures

from owid.catalog import Table
from owid.catalog import processing as pr

# Tables are pandas DataFrames with metadata
tb = Table(df, metadata={"short_name": "population"})

# Metadata propagates through operations
tb_filtered = tb[tb["year"] > 2000]  # Keeps metadata
tb_merged = pr.merge(tb1, tb2, on="country")  # Merges metadata

Documentation

For detailed documentation, see:

Architecture

graph TB
etl -->|reads| snapshot[upstream datasets]
etl -->|generates| s3[data catalog]
catalog[owid-catalog] -->|queries| s3

This library is part of OWID's ETL project, which contains recipes for all datasets we publish.

Development

You need Python 3.10+, uv and make installed. Clone the repo, then you can simply run:

# run all unit tests and CI checks
make test

# watch for changes, then run all checks
make watch

Changelog

v1.0.1

  • ResponseSet ergonomics
    • Remove deprecated ResponseSet.results property (use .items instead)
    • Add .to_dict() method for serializing results to plain dicts (useful for AI/LLM context windows)
    • Add all_fields parameter to .to_frame() to temporarily override display mode without mutating instance state

v1.0.0

  • New unified Client API
    • owid.catalog.Client as single entry point with ChartsAPI, IndicatorsAPI, TablesAPI
    • Quick access via search() and fetch() convenience functions
    • Rich result types: ChartResult, IndicatorResult, TableResult with ResponseSet container
  • Charts API
    • Fetch chart data by slug, URL, or slug with query params
    • Parse chart slugs from grapher/explorer URLs via parse_chart_slug()
    • Explorer best-effort fetching with graceful error handling
    • set_ui_advanced() / set_ui_basic() for display configuration
  • Tables API
    • Search catalog by table, namespace, version, dataset, and channel
    • Fetch tables directly by catalog path
    • Embedded catalog index with local caching
  • Indicators API
    • Semantic search via search.owid.io vector embeddings
    • Sort by relevance (similarity + popularity blend) or similarity only
    • fetch() for single-column indicator or fetch_table() for the full table
  • Search & discovery
    • Fuzzy, exact, contains, and regex matching modes
    • .latest() filtering to keep only newest versions
    • Popularity scores (0.0-1.0) from analytics views, results sorted by popularity
    • refresh_index parameter to force catalog index reload
  • Data structures integration
    • All fetch() methods return owid.catalog.Table with full metadata
    • CatalogPath helper for parsing catalog paths
    • Lazy loading with load_data=False for deferred data access
  • Library reorganization
    • Restructured into owid.catalog.core (data structures) and owid.catalog.api (remote access)
    • catalog.find() deprecated in favor of Client().tables.search() (backwards compat maintained)
    • Legacy code moved to owid.catalog.api.legacy
    • New dependencies: pydantic v2.0+
  • Private data support
    • Private datasets served from separate R2 bucket
    • API can fetch private data from private bucket
  • Performance
    • Vectorized operations replacing iterrows() in TablesAPI
    • Embedded catalog index loading (removed ETLCatalog dependency)
    • Modularized search into helper methods
  • Other
    • Thumbnail display in ResponseSet for chart results
    • JSON output format support
    • Comprehensive exception handling: ChartNotFoundError, LicenseError
    • API URLs immutable with Pydantic Field(frozen=True)
See previous versions

v0.4.5

  • Allow both table and dataset parameters in find() (they can now be used together)
  • Migrate from pyright to ty type checker for improved type checking

v0.4.4

  • Enhanced find() with better search capabilities:
    • Case-insensitive search by default (use case=True for case-sensitive)
    • Regex support enabled by default for table and dataset parameters
    • New fuzzy search with fuzzy=True - typo-tolerant matching sorted by relevance
    • Configurable fuzzy threshold (0-100) to control match strictness
  • New dependency: rapidfuzz for fuzzy string matching

v0.4.3

  • Fixed minor bugs

v0.4.0

  • Highlights
    • Support for Python 3.10-3.13 (was 3.11-3.13)
    • Drop support for Python 3.9 (breaking change)
  • Others
    • Deprecate Walden.
    • Dependencies: Change rdata for pyreadr.
    • Support: indicator dimensions.
    • Support: MDIMs.
    • Switched from Poetry to UV package manager.
    • New decorator @keep_metadata to propagate metadata in pandas functions.
  • Fixes: Table.apply, groupby.apply, metadata propagation, type hinting, etc.

v0.3.11

  • Add support for Python 3.12 in pypackage.toml

v0.3.10

  • Add experimental chart data API in owid.catalog.charts

v0.3.9

  • Switch from isort & black & fake8 to ruff

v0.3.8

  • Pin dataclasses-json==0.5.8 to fix error with python3.9

v0.3.7

  • Fix bugs.
  • Improve metadata propagation.
  • Improve metadata YAML file handling, to have common definitions.
  • Remove DatasetMeta.origins.

v0.3.6

  • Fixed tons of bugs
  • processing.py module with pandas-like functions that propagate metadata
  • Support for Dynamic YAML files
  • Support for R2 alongside S3

v0.3.5

  • Remove catalog.frames; use owid-repack package instead
  • Relax dependency constraints
  • Add optional channel argument to DatasetMeta
  • Stop supporting metadata in Parquet format, load JSON sidecar instead
  • Fix errors when creating new Table columns

v0.3.4

  • Bump pyarrow dependency to enable Python 3.11 support

v0.3.3

  • Add more arguments to Table.__init__ that are often used in ETL
  • Add Dataset.update_metadata function for updating metadata from YAML file
  • Python 3.11 support via update of pyarrow dependency

v0.3.2

  • Fix a bug in Catalog.__getitem__()
  • Replace mypy type checker by pyright

v0.3.1

  • Sort imports with isort
  • Change black line length to 120
  • Add grapher channel
  • Support path-based indexing into catalogs

v0.3.0

  • Update OWID_CATALOG_VERSION to 3
  • Support multiple formats per table
  • Support reading and writing parquet files with embedded metadata
  • Optional repack argument when adding tables to dataset
  • Underscore |
  • Get version field from DatasetMeta init
  • Resolve collisions of underscore_table function
  • Convert version to str and load json dimensions

v0.2.9

  • Allow multiple channels in catalog.find function

v0.2.8

  • Update OWID_CATALOG_VERSION to 2

v0.2.7

  • Split datasets into channels (garden, meadow, open_numbers, ...) and make garden default one
  • Add .find_latest method to Catalog

v0.2.6

  • Add flag is_public for public/private datasets
  • Enforce snake_case for table, dataset and variable short names
  • Add fields published_by and published_at to Source
    • Added a list of supported and unsupported operations on columns
    • Updated pyarrow

v0.2.5

  • Fix ability to load remote CSV tables

v0.2.4

  • Update the default catalog URL to use a CDN

v0.2.3

  • Fix methods for finding and loading data from a LocalCatalog

v0.2.2

  • Repack frames to compact dtypes on Table.to_feather()

v0.2.1

  • Fix key typo used in version check

v0.2.0

  • Copy dataset metadata into tables, to make tables more traceable
  • Add API versioning, and a requirement to update if your version of this library is too old

v0.1.1

  • Add support for Python 3.8

v0.1.0

  • Initial release, including searching and fetching data from a remote catalog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

owid_catalog-1.0.1.tar.gz (342.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

owid_catalog-1.0.1-py3-none-any.whl (126.3 kB view details)

Uploaded Python 3

File details

Details for the file owid_catalog-1.0.1.tar.gz.

File metadata

  • Download URL: owid_catalog-1.0.1.tar.gz
  • Upload date:
  • Size: 342.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for owid_catalog-1.0.1.tar.gz
Algorithm Hash digest
SHA256 f81b8384c6159b5cba10340da6e557b9964d1af4985bc81be78a767137618bb4
MD5 677841420913bae3e128e6ecaec60e85
BLAKE2b-256 fd869b8f06ae9a89ad908a8ae0283c03e8ed544a32cc99894eb8255f310fae86

See more details on using hashes here.

File details

Details for the file owid_catalog-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: owid_catalog-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 126.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for owid_catalog-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3de9b1de3f21a98cbb833edd1cdf5e9c298c02bcf2ecdf59e0a3052ed60678dd
MD5 be4916cc2fb2703fe5f8c46692f95690
BLAKE2b-256 e1eb985ee6e88306b63d1a1a6b5c9ada842a7e81f92d99f159b2dbeeee3bf31a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page