Skip to main content

A catalog of open data related to the US energy system.

Project description

Tox-PyTest Status Codecov Test Coverage Read the Docs Build Status PyPI Latest Version conda-forge Version Supported Python Versions Any color you want, so long as it's black.

This repository houses a data catalog distributing open energy system data liberated by Catalyst Cooperative as part of our Public Utility Data Liberation Project (PUDL). It uses the Intake library developed by Anaconda to provide a uniform interface to versioned data releases hosted on publicly accessible cloud resources.

Catalog Contents

Currently available datasets

  • Hourly Emissions from the EPA CEMS (Apache Parquet)

Future datasets

Ongoing Development

Development is currently being organized under these epics in the main PUDL repo:

See the issues in this repository for more detailed tasks.

Planned data distribution system

We’re in the process of implementing automated nightly builds of all of our data products for each development branch with new commits in the main PUDL repository. This will allow us to do exhaustive integration testing and data validation on a daily basis. If all of the tests and data validation pass, then a new version of the data products (SQLite databases and Parquet files) will be produced, and placed into cloud storage.

These outputs will be made available via a data catalog on a corresponding branch in this pudl-catalog repository. Ingeneral only the catalogs and data resources corresponding to the HEAD of development and feature branches will be available. Releases that are tagged on the main branch will be retained long term.

The idea is that for any released version of PUDL, you should also be able to install a corresponding data catalog, and know that the software and the data are compatible. You can also install just the data catalog with minimal dependencies, and not need to worry about the PUDL software that produced it at all, if you simply want to access the DBs or Parquet files directly.

In development, this arrangement will mean that every morning you should have access to a fully processed set of data products that reflect the branch of code that you’re working on, rather than the data and code getting progressively further out of sync as you do development, until you take the time to re-run the full ETL locally yourself.

Example Usage

See the notebook included in this repository for more details.

Import Intake Catalogs

The pudl_catalog registers as an available data source within Intake when it’s installed, so you can grab it from the top level Intake catalog. To see what data sources are available within the catalog you turn it into a list (yes this is weird).

import intake
import pandas as pd
from pudl_catalog.helpers import year_state_filter

pudl_cat = intake.cat.pudl_cat
list(pudl_cat)
['hourly_emissions_epacems', 'hourly_emissions_epacems_partitioned']

Inspect the catalog data source

Printing the data source will show you the YAML that defines the source, but with all the Jinja template fields interpolated and filled in:

pudl_cat.hourly_emissions_epacems
hourly_emissions_epacems:
  args:
    engine: pyarrow
    storage_options:
      simplecache:
        cache_storage: /home/zane/.cache/intake
    urlpath: simplecache::gs://intake.catalyst.coop/dev/hourly_emissions_epacems.parquet
  description: Hourly pollution emissions and plant operational data reported via
    Continuous Emissions Monitoring Systems (CEMS) as required by 40 CFR Part 75.
    Includes CO2, NOx, and SO2, as well as the heat content of fuel consumed and gross
    power output. Hourly values reported by US EIA ORISPL code and emissions unit
    (smokestack) ID.
  driver: intake_parquet.source.ParquetSource
  metadata:
    catalog_dir: /home/zane/code/catalyst/pudl-catalog/src/pudl_catalog/
    license:
      name: CC-BY-4.0
      path: https://creativecommons.org/licenses/by/4.0
      title: Creative Commons Attribution 4.0
    path: https://ampd.epa.gov/ampd
    provider: US Environmental Protection Agency Air Markets Program
    title: Continuous Emissions Monitoring System (CEMS) Hourly Data
    type: application/parquet

Data source specific metadata

The source.discover() method will show you some internal details of the data source, including what columns are available and their data types:

pudl_cat.hourly_emissions_epacems.discover()
{'dtype': {'plant_id_eia': 'int32',
  'unitid': 'object',
  'operating_datetime_utc': 'datetime64[ns, UTC]',
  'year': 'int32',
  'state': 'int64',
  'facility_id': 'int32',
  'unit_id_epa': 'object',
  'operating_time_hours': 'float32',
  'gross_load_mw': 'float32',
  'heat_content_mmbtu': 'float32',
  'steam_load_1000_lbs': 'float32',
  'so2_mass_lbs': 'float32',
  'so2_mass_measurement_code': 'int64',
  'nox_rate_lbs_mmbtu': 'float32',
  'nox_rate_measurement_code': 'int64',
  'nox_mass_lbs': 'float32',
  'nox_mass_measurement_code': 'int64',
  'co2_mass_tons': 'float32',
  'co2_mass_measurement_code': 'int64'},
 'shape': (None, 19),
 'npartitions': 1,
 'metadata': {'title': 'Continuous Emissions Monitoring System (CEMS) Hourly Data',
  'type': 'application/parquet',
  'provider': 'US Environmental Protection Agency Air Markets Program',
  'path': 'https://ampd.epa.gov/ampd',
  'license': {'name': 'CC-BY-4.0',
   'title': 'Creative Commons Attribution 4.0',
   'path': 'https://creativecommons.org/licenses/by/4.0'},
  'catalog_dir': '/home/zane/code/catalyst/pudl-catalog/src/pudl_catalog/'}}

Read some data from the catalog

To read data from the source you call it with some arguments. Here we’re supplying filters (in “disjunctive normal form”) that select only a subset of the available years and states. This limits the set of Parquet files that need to be scanned to find the requested data (since the files are partitioned by year and state) and also ensures that you don’t get back a 100GB dataframe that crashes your laptop. These arguments are passed through to dask.dataframe.read_parquet() since Dask dataframes are the default container for Parquet data. Given those arguments, you convert the source to a Dask dataframe and the use .compute() on that dataframe to actually read the data and return a pandas dataframe:

filters = year_state_filter(
    years=[2019, 2020],
    states=["ID", "CO", "TX"],
)
epacems_df = (
    pudl_cat.hourly_emissions_epacems(filters=filters)
    .to_dask().compute()
)
epacems_df[[
    "plant_id_eia",
    "unitid",
    "operating_datetime_utc",
    "year",
    "state",
    "facility_id",
    "unit_id_epa",
    "operating_time_hours",
    "gross_load_mw",
    "heat_content_mmbtu",
    "co2_mass_tons",
]].head()

plant_id_eia

unitid

operating_datetime_utc

year

state

facility_id

unit_id_epa

operating_time_hours

gross_load_mw

heat_content_mmbtu

co2_mass_tons

469

4

2019-01-01 07:00:00+00:00

2019

CO

79

298

1.0

203.0

2146.2

127.2

469

4

2019-01-01 08:00:00+00:00

2019

CO

79

298

1.0

203.0

2152.7

127.6

469

4

2019-01-01 09:00:00+00:00

2019

CO

79

298

1.0

204.0

2142.2

127.0

469

4

2019-01-01 10:00:00+00:00

2019

CO

79

298

1.0

204.0

2129.2

126.2

469

4

2019-01-01 11:00:00+00:00

2019

CO

79

298

1.0

204.0

2160.6

128.1

Benefits of Intake Catalogs

The Intake docs list a bunch of potential use cases. Here are some features that we’re excited to take advantage of:

Rich Metadata

The Intake catalog provides a human and machine readable container for metadata describing the underlying data, so that you can understand what the data contains before downloading all of it. We intend to automate the production of the catalog using PUDL’s metadata models so it’s always up to date.

Local data caching

Rather than downloading the same data repeatedly, in many cases it’s possible to transparently cache the data locally for faster access later. This is especially useful when you’ve got plenty of disk space and a slower network connection, or typically only work with a small subset of a much larger dataset.

Manage data like software

Intake data catalogs can be packaged and versioned just like Python software packages, allowing us to manage depedencies between different versions of software and the data it operates on to ensure they are compatible. It also allows you to have multiple versions of the same data installed locally, and to switch between them seamlessly when you change software environments. This is especially useful when doing a mix of development and analysis, where we need to work with the newest data (which may not yet be fully integrated) as well as previously released data and software that’s more stable.

A Uniform API

All the data sources of a given type (parquet, SQL) would have the same interface, reducing the number of things a user needs to remember to access the data.

Decoupling Data Location and Format

Having users access the data through the catalog rather than directly means that the underlying storage location and file formats can change over time as needed without requiring the user to change how they are accessing the data.

Additional Intake Resources

Licensing

Our code, data, and other work are permissively licensed for use by anybody, for any purpose, so long as you give us credit for the work we’ve done.

  • For software we use the MIT License.

  • For data, documentation, and other non-software works we use the CC-BY-4.0 license.

Contact Us

  • For general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions

  • If you’d like to get occasional updates about our projects sign up for our email list.

  • Want to schedule a time to chat with us one-on-one? Join us for Office Hours

  • Follow us on Twitter: @CatalystCoop

  • More info on our website: https://catalyst.coop

  • For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: pudl@catalyst.coop

About Catalyst Cooperative

Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.

Funding

This work is supported by a generous grant from the Alfred P. Sloan Foundation and their Energy & Environment Program

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catalystcoop.pudl_catalog-0.1.0.tar.gz (51.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

catalystcoop.pudl_catalog-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file catalystcoop.pudl_catalog-0.1.0.tar.gz.

File metadata

File hashes

Hashes for catalystcoop.pudl_catalog-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8940083d202d3b353fbd39f1de94114c463e7128750ab2dc9909603a2f71a0f7
MD5 8669e19dc1af71721d7b9d351a010402
BLAKE2b-256 db12e5728b34b3fb3eb240dd158f2372adef9070e2dc270e578b66f294e5416f

See more details on using hashes here.

File details

Details for the file catalystcoop.pudl_catalog-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for catalystcoop.pudl_catalog-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5a92c34175f35e341e0bb8d8136a92e83c1e2178abfdcc67956f6a1e10acac22
MD5 08aa9535267c9eda7a63264aa55cc473
BLAKE2b-256 89d691822632a6602f35d5a3f77023aa819db60fbf13756d72bc1b8eec588bd6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page