Extract CO2 emissions data from PDF sustainability reports using LLMs
Project description
climatextract
📖 Documentation: gist-sustainability.github.io/climatextract
climatextract is a retrieval-augmented generation (RAG) pipeline that surfaces CO₂ emissions data from corporate sustainability reports. It embeds PDF pages, ranks relevant context, and prompts a large language model to extract Scope 1-3 emissions into structured tables for downstream analysis.
Background
This project began as the team's submission for the 2024 ClimateNLP workshop at ACL. Built by the LMU SODA Lab in collaboration with the Data Service Centre of Deutsche Bundesbank, climatextract combines research around ESG reporting and Intelligent Document Processing to automate what was previously a tedious manual annotation process.
This repository is organized as follows:
climatextract: package source codedata: source data to be analyzeddocs: package documentation (built with mkdocs)tests: acceptance tests
Setup
Python environment
It is recommended to run the code in a virtual environment using at least Python 3.11.
First, check out the code, then create a virtual environment and install all dependencies:
cd climatextract
python -m venv co2_info_extraction
source co2_info_extraction/bin/activate
pip install -e .
See the Installation guide for additional steps and alternative deployment options you have.
Usage
Place your PDF sustainability reports in the data/pdfs/ directory, then run the extraction pipeline:
from climatextract import extract
result_path = extract("./data/pdfs/company_2023_report.pdf")
Results are saved as CSV files in output/<run-id>/. See the Quickstart for more examples.
Configuration
Extraction behavior is controlled via a climatextract.toml file in your working directory. It lets you configure the LLM model, embedding model, prompt type, year range, semantic search parameters, and more. See the Configuration guide for all available options.
Running tests
python -m pytest
See tests/README.md for details on the acceptance test suite.
Documentation
The full documentation is hosted at gist-sustainability.github.io/climatextract and covers usage, configuration, architecture, and API reference:
| Section | Description |
|---|---|
| Installation | Detailed setup instructions |
| Quickstart | First extraction walkthrough |
| Configuration | All TOML configuration options |
| Custom Providers | Plug in a non-Azure LLM or embedding backend |
| Architecture | Pipeline design and components |
| Prompts | How extraction prompts work |
| Evaluation | Measuring extraction quality |
| API Reference | Public API functions |
To build and serve the docs locally:
pip install -e '.[docs]'
mkdocs serve
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file climatextract-0.3.1.tar.gz.
File metadata
- Download URL: climatextract-0.3.1.tar.gz
- Upload date:
- Size: 74.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69494e408372aeb2c14e514b32f912e3d34385d9a1e0f4ff96b74670a067d725
|
|
| MD5 |
f45cffbfba9303cee946ad81d67b7acd
|
|
| BLAKE2b-256 |
439f157e93c98252b9c91bfec2aca250db193ea70dada2ce23c81905a4df5f41
|
File details
Details for the file climatextract-0.3.1-py3-none-any.whl.
File metadata
- Download URL: climatextract-0.3.1-py3-none-any.whl
- Upload date:
- Size: 80.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e63c4cfd251732f7806872c0e84fcf404f0b7590e1561ca347155960b81b0aa3
|
|
| MD5 |
0a4c9d38b015a1abfc4f8c12a1269c30
|
|
| BLAKE2b-256 |
1c078dd94a6fa38a5af1cd856a71f2438e22b10aafff32f711b5c6ef96a60e01
|