Skip to main content

Extract CO2 emissions data from PDF sustainability reports using LLMs

Project description

climatextract

📖 Documentation: gist-sustainability.github.io/climatextract

climatextract is a retrieval-augmented generation (RAG) pipeline that surfaces CO₂ emissions data from corporate sustainability reports. It embeds PDF pages, ranks relevant context, and prompts a large language model to extract Scope 1-3 emissions into structured tables for downstream analysis.

Background

This project began as the team's submission for the 2024 ClimateNLP workshop at ACL. Built by the LMU SODA Lab in collaboration with the Data Service Centre of Deutsche Bundesbank, climatextract combines research around ESG reporting and Intelligent Document Processing to automate what was previously a tedious manual annotation process.

This repository is organized as follows:

  • climatextract: package source code
  • data: source data to be analyzed
  • docs: package documentation (built with mkdocs)
  • tests: acceptance tests

Setup

Python environment

It is recommended to run the code in a virtual environment using at least Python 3.11.

First, check out the code, then create a virtual environment and install all dependencies:

cd climatextract
python -m venv co2_info_extraction
source co2_info_extraction/bin/activate
pip install -e .

See the Installation guide for additional steps and alternative deployment options you have.

Usage

Place your PDF sustainability reports in the data/pdfs/ directory, then run the extraction pipeline:

from climatextract import extract

result_path = extract("./data/pdfs/company_2023_report.pdf")

Results are saved as CSV files in output/<run-id>/. See the Quickstart for more examples.

Configuration

Extraction behavior is controlled via a climatextract.toml file in your working directory. It lets you configure the LLM model, embedding model, prompt type, year range, semantic search parameters, and more. See the Configuration guide for all available options.

Running tests

python -m pytest

See tests/README.md for details on the acceptance test suite.

Documentation

The full documentation is hosted at gist-sustainability.github.io/climatextract and covers usage, configuration, architecture, and API reference:

Section Description
Installation Detailed setup instructions
Quickstart First extraction walkthrough
Configuration All TOML configuration options
Custom Providers Plug in a non-Azure LLM or embedding backend
Architecture Pipeline design and components
Prompts How extraction prompts work
Evaluation Measuring extraction quality
API Reference Public API functions

To build and serve the docs locally:

pip install -e '.[docs]'
mkdocs serve

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

climatextract-0.3.1.tar.gz (74.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

climatextract-0.3.1-py3-none-any.whl (80.6 kB view details)

Uploaded Python 3

File details

Details for the file climatextract-0.3.1.tar.gz.

File metadata

  • Download URL: climatextract-0.3.1.tar.gz
  • Upload date:
  • Size: 74.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for climatextract-0.3.1.tar.gz
Algorithm Hash digest
SHA256 69494e408372aeb2c14e514b32f912e3d34385d9a1e0f4ff96b74670a067d725
MD5 f45cffbfba9303cee946ad81d67b7acd
BLAKE2b-256 439f157e93c98252b9c91bfec2aca250db193ea70dada2ce23c81905a4df5f41

See more details on using hashes here.

File details

Details for the file climatextract-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: climatextract-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 80.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for climatextract-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e63c4cfd251732f7806872c0e84fcf404f0b7590e1561ca347155960b81b0aa3
MD5 0a4c9d38b015a1abfc4f8c12a1269c30
BLAKE2b-256 1c078dd94a6fa38a5af1cd856a71f2438e22b10aafff32f711b5c6ef96a60e01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page