Extract CO2 emissions data from PDF sustainability reports using LLMs
Project description
information-extraction-pilot
Information-extraction-pilot is a retrieval-augmented generation (RAG) pipeline that surfaces CO₂ emissions data from corporate sustainability reports. It embeds PDF pages, ranks relevant context, and prompts a large language model to extract Scope 1–3 emissions into structured tables for downstream analysis.
Background
This pilot began as the team’s submission for the 2024 ClimateNLP workshop at ACL. The repository now serves as the maintained codebase for automating emissions extraction, while retaining the project’s research lineage.
This repository is organized as follows:
data: source data to be analyzed and the gold standard datasetoutput: pipeline resultsprompt: prompt templates and queriessrc: pipeline source codetests: automated checks for the pilot
Setup
Python environment
It is recommended to run the code in a virtual environment using at least Python 3.11:
If you are using pip, run
python3.11 -m venv co2_info_extraction pip install -r requirements.txt
to install all dependencies.
Other dependencies
Since the python package pdf2image is a wrapper around poppler, you will need to install it. See https://pypi.org/project/pdf2image/
Azure Authentication
This repository uses Azure modules, so you need to have access to it. The code relies on the presence of an .env file that stores your credentials. Configure your own authentication workflow with environment variables, see the description.
Azure Databricks
Furthermore, the repository uses mlflow for tracking of experiments. To set up access to the Mlflow Tracking Server on Azure Databricks, you need to create a personal access token. Follow the following steps:
- Log into Azure.
- Search for
gist-mlflow-tracking-serverto find the respective Databricks instance. - Copy the URL which contains azuredatabricks.net and save it in the
.envfile asDATABRICKS_HOSTvariable. - Save the variable
MLFLOW_TRACKING_URIwith the valuedatabricksto the.envfile. - Launch the workspace and click on your initial in the upper right corner.
- Navigate to
Settings > User > Developer > Access tokensand click onManage. Generate a new access token and save it in the.envfile asDATABRICKS_TOKENvariable. Be aware that it takes some time for the token to get activated, so you might get 401 authentication errors in the beginning when running the code. This should be resolved after some time.
Run of main.py
The script uses three dataclasses to manage configurations: MlflowParams, ConfigParams, and ExperimentParams. These can be customized directly in main.py or through external configuration files integrated into config.py.
Key Parameters
Parameters that can be updated through the helpers.update_dataclass() function.
ConfigParams:
-
gold_standard: Currently supportsgist_2025(default) -
filename_list: List of filenames that will be input into the pipeline, can be adjusted manually or via the functionhelpers.get_file_paths
ExperimentParams:
-
emb_model: Name of the embedding model. -
llm_model: Name of the LLM to use. -
prompt_type: Type of prompt (default or custom_gaia). -
search_query: Query passed to the pipeline. -
year_minandyear_max: Filters for data based on year.
Running the Script
Standard Execution
To run the pipeline, execute:
python main.py
Customizing Parameters
Modify the parameters in main.py by updating the relevant dataclass instances. For example:
helpers.update_dataclass(config_params, { 'filename_list': ['./data/pdfs/apple_2021_en.pdf'], }) helpers.update_dataclass(experiment_params, { 'prompt_type': 'custom_gaia', 'search_query': "What are the carbon emissions for the last 10 years?", })
Logging and Debugging
-
Set the desired log level in the
logging.basicConfig()call, e.g.,logging.DEBUGfor verbose logs. -
Outputs and errors will appear in the console.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file climxtract_test-0.1.1.tar.gz.
File metadata
- Download URL: climxtract_test-0.1.1.tar.gz
- Upload date:
- Size: 57.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6133cf2b3321a6fde51fd0ddee0bcb65e520095b3e7f774081648896a9d192c4
|
|
| MD5 |
252c93de1b4c878e3611df2eb5d3a28c
|
|
| BLAKE2b-256 |
b921e34f492e59ccab943952ffcbd9fda400e316185487b6d1944c7bd7f2ea27
|
File details
Details for the file climxtract_test-0.1.1-py3-none-any.whl.
File metadata
- Download URL: climxtract_test-0.1.1-py3-none-any.whl
- Upload date:
- Size: 62.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a17bcccef55431911ef0f66a4490d6002bd470c802070479ae891235011898c8
|
|
| MD5 |
d99d4fbd1a2508880bc55452842e43e2
|
|
| BLAKE2b-256 |
77e0924c9adc35c0509255ca82dfd58b6cd7e85131d4090fa9c33ac4999c85d1
|