Extract CO2 emissions data from PDF sustainability reports using LLMs

These details have not been verified by PyPI

Project description

information-extraction-pilot

Information-extraction-pilot is a retrieval-augmented generation (RAG) pipeline that surfaces CO₂ emissions data from corporate sustainability reports. It embeds PDF pages, ranks relevant context, and prompts a large language model to extract Scope 1–3 emissions into structured tables for downstream analysis.

Background

This pilot began as the team’s submission for the 2024 ClimateNLP workshop at ACL. The repository now serves as the maintained codebase for automating emissions extraction, while retaining the project’s research lineage.

This repository is organized as follows:

data: source data to be analyzed and the gold standard dataset
output: pipeline results
prompt: prompt templates and queries
src: pipeline source code
tests: automated checks for the pilot

Setup

Python environment

It is recommended to run the code in a virtual environment using at least Python 3.11:

If you are using pip, run

python3.11 -m venv co2_info_extraction pip install -r requirements.txt

to install all dependencies.

Other dependencies

Since the python package pdf2image is a wrapper around poppler, you will need to install it. See https://pypi.org/project/pdf2image/

Azure Authentication

This repository uses Azure modules, so you need to have access to it. The code relies on the presence of an .env file that stores your credentials. Configure your own authentication workflow with environment variables, see the description.

Azure Databricks

Furthermore, the repository uses mlflow for tracking of experiments. To set up access to the Mlflow Tracking Server on Azure Databricks, you need to create a personal access token. Follow the following steps:

Log into Azure.
Search for gist-mlflow-tracking-server to find the respective Databricks instance.
Copy the URL which contains azuredatabricks.net and save it in the .env file as DATABRICKS_HOST variable.
Save the variable MLFLOW_TRACKING_URI with the value databricks to the .env file.
Launch the workspace and click on your initial in the upper right corner.
Navigate to Settings > User > Developer > Access tokensand click on Manage. Generate a new access token and save it in the .env file as DATABRICKS_TOKEN variable. Be aware that it takes some time for the token to get activated, so you might get 401 authentication errors in the beginning when running the code. This should be resolved after some time.

Run of main.py

The script uses three dataclasses to manage configurations: MlflowParams, ConfigParams, and ExperimentParams. These can be customized directly in main.py or through external configuration files integrated into config.py.

Key Parameters

Parameters that can be updated through the helpers.update_dataclass() function.

ConfigParams:

gold_standard: Currently supports gist_2025 (default)
filename_list: List of filenames that will be input into the pipeline, can be adjusted manually or via the function helpers.get_file_paths

ExperimentParams:

emb_model: Name of the embedding model.
llm_model: Name of the LLM to use.
prompt_type: Type of prompt (default or custom_gaia).
search_query: Query passed to the pipeline.
year_min and year_max: Filters for data based on year.

Running the Script

Standard Execution

To run the pipeline, execute:

python main.py

Customizing Parameters

Modify the parameters in main.py by updating the relevant dataclass instances. For example:

helpers.update_dataclass(config_params, { 'filename_list': ['./data/pdfs/apple_2021_en.pdf'], }) helpers.update_dataclass(experiment_params, { 'prompt_type': 'custom_gaia', 'search_query': "What are the carbon emissions for the last 10 years?", })

Logging and Debugging

Set the desired log level in the logging.basicConfig() call, e.g., logging.DEBUG for verbose logs.
Outputs and errors will appear in the console.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Dec 2, 2025

0.1.0

Dec 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

climxtract_test-0.1.1.tar.gz (57.6 kB view details)

Uploaded Dec 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

climxtract_test-0.1.1-py3-none-any.whl (62.0 kB view details)

Uploaded Dec 2, 2025 Python 3

File details

Details for the file climxtract_test-0.1.1.tar.gz.

File metadata

Download URL: climxtract_test-0.1.1.tar.gz
Upload date: Dec 2, 2025
Size: 57.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for climxtract_test-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`6133cf2b3321a6fde51fd0ddee0bcb65e520095b3e7f774081648896a9d192c4`
MD5	`252c93de1b4c878e3611df2eb5d3a28c`
BLAKE2b-256	`b921e34f492e59ccab943952ffcbd9fda400e316185487b6d1944c7bd7f2ea27`

See more details on using hashes here.

File details

Details for the file climxtract_test-0.1.1-py3-none-any.whl.

File metadata

Download URL: climxtract_test-0.1.1-py3-none-any.whl
Upload date: Dec 2, 2025
Size: 62.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for climxtract_test-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a17bcccef55431911ef0f66a4490d6002bd470c802070479ae891235011898c8`
MD5	`d99d4fbd1a2508880bc55452842e43e2`
BLAKE2b-256	`77e0924c9adc35c0509255ca82dfd58b6cd7e85131d4090fa9c33ac4999c85d1`

See more details on using hashes here.

climxtract-test 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

information-extraction-pilot

Background

Setup

Python environment

Other dependencies

Azure Authentication

Azure Databricks

Run of main.py

Key Parameters

Running the Script

Standard Execution

Customizing Parameters

Logging and Debugging

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes