Extract CO2 emissions data from PDF sustainability reports using LLMs

These details have not been verified by PyPI

Project description

climatextract

📖 Documentation: gist-sustainability.github.io/climatextract

climatextract is a retrieval-augmented generation (RAG) pipeline that surfaces CO₂ emissions data from corporate sustainability reports. It embeds PDF pages, ranks relevant context, and prompts a large language model to extract Scope 1-3 emissions into structured tables for downstream analysis.

Background

This project began as the team's submission for the 2024 ClimateNLP workshop at ACL. Built by the LMU SODA Lab in collaboration with the Data Service Centre of Deutsche Bundesbank, climatextract combines research around ESG reporting and Intelligent Document Processing to automate what was previously a tedious manual annotation process.

This repository is organized as follows:

climatextract: package source code
data: source data to be analyzed
docs: package documentation (built with mkdocs)
tests: acceptance tests

Setup

Python environment

It is recommended to run the code in a virtual environment using at least Python 3.11.

First, check out the code, then create a virtual environment and install all dependencies:

cd climatextract
python -m venv co2_info_extraction
source co2_info_extraction/bin/activate
pip install -e .

See the Installation guide for additional steps and alternative deployment options you have.

Usage

Place your PDF sustainability reports in the data/pdfs/ directory, then run the extraction pipeline:

from climatextract import extract

result_path = extract("./data/pdfs/company_2023_report.pdf")

Results are saved as CSV files in output/<run-id>/. See the Quickstart for more examples.

Configuration

Extraction behavior is controlled via a climatextract.toml file in your working directory. It lets you configure the LLM model, embedding model, prompt type, year range, semantic search parameters, and more. See the Configuration guide for all available options.

Running tests

python -m pytest

See tests/README.md for details on the acceptance test suite.

Documentation

The full documentation is hosted at gist-sustainability.github.io/climatextract and covers usage, configuration, architecture, and API reference:

Section	Description
Installation	Detailed setup instructions
Quickstart	First extraction walkthrough
Configuration	All TOML configuration options
Custom Providers	Plug in a non-Azure LLM or embedding backend
Architecture	Pipeline design and components
Prompts	How extraction prompts work
Evaluation	Measuring extraction quality
API Reference	Public API functions

To build and serve the docs locally:

pip install -e '.[docs]'
mkdocs serve

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.2

Jun 11, 2026

0.3.1

May 6, 2026

0.3.0

May 6, 2026

0.2.2

Jan 2, 2026

0.2.1

Dec 31, 2025

0.2.0

Dec 16, 2025

0.1.1

Dec 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

climatextract-0.3.2.tar.gz (75.9 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

climatextract-0.3.2-py3-none-any.whl (81.8 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file climatextract-0.3.2.tar.gz.

File metadata

Download URL: climatextract-0.3.2.tar.gz
Upload date: Jun 11, 2026
Size: 75.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for climatextract-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`9abb989c98f227fcbdd3add8276287926907fdff7fc4f5734094c78aa4aeccf7`
MD5	`27805930ee76b085e0c85b4d7f29c54d`
BLAKE2b-256	`cb37bb2c0aa2567bbafb186fdfc5e094519bdf0e6527465a303737eeaf719765`

See more details on using hashes here.

File details

Details for the file climatextract-0.3.2-py3-none-any.whl.

File metadata

Download URL: climatextract-0.3.2-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 81.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for climatextract-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`146a02616eb1e72ad45da79ae81648153e7450f595b4ca282b21e612b6904891`
MD5	`bd0d641a3cb808c83f484d14e56e6320`
BLAKE2b-256	`8629b9cba064e6ab83fb140f94b236a58cf7332be4a80a46a8a42d536b0f0de9`

See more details on using hashes here.

climatextract 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

climatextract

Background

Setup

Python environment

Usage

Configuration

Running tests

Documentation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes