Skip to main content

OS-Climate Data Extraction Tool

Project description

💬 Important

On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation (FINOS <https://finos.org>), with OS-Climate, an open source community dedicated to building data technologies, modelling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the FINOS governance framework <https://community.finos.org/docs/governance>; read more on finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg <https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg>_

OSC Transformer Pre-Steps

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Built using PDM Project generated with PyScaffold OpenSSF Scorecard

OS-Climate Transformer Pre-Steps Tool

This code provides you with a cli tool with the possibility to extract data from a pdf to a json document and to create a training data set for a later usage in the context of transformer models to extract relevant information, but it can also be used independently.

Quick start

Install via PyPi

You can simply install the package via:

$ pip install osc-transformer-presteps

Afterwards, you can use the tooling as a CLI tool by typing:

$ osc-transformer-presteps

We are using Typer to provide a user-friendly CLI. All details and help will be shown within the CLI itself and are not described here in more detail.

Example 1: Extracting Data from PDFs

Assume the folder structure is as follows:

project/
├-input/
│ ├-file_1.pdf
│ ├-file_2.pdf
│ └─file_3.pdf
├-logs/
└─output/

Now, after installing osc-transformer-presteps, run the following command to extract data from the PDFs to JSON:

$ osc-transformer-presteps extraction run-local-extraction 'input' --output-folder='output' --logs-folder='logs' --force

Note: The --force flag overcomes encryption. Please check if this is a legal action in your jurisdiction.

Example 2: Curating a New Training Data Set for Relevance Detector

To perform curation, you will need a KPI mapping file and an annotations file. Here are examples of such files:

KPI Mapping File:

kpi_mapping.csv

kpi_id

question

sectors

add_year

kpi_category

0

What is the company name?

“OG, CM, CU”

FALSE

TEXT

  • kpi_id: The unique identifier for each KPI.

  • question: The specific question being asked to extract relevant information.

  • sectors: The industry sectors to which the KPI applies.

  • add_year: Indicates whether to include the year in the extracted data (TRUE/FALSE).

  • kpi_category: The category of the KPI, typically specifying the data type (e.g., TEXT).

Annotation File:

annotations_file.xlsx

company

source_file

source_page

kpi_id

year

answer

data_type

relevant_paragraphs

annotator

sector

Royal Dutch Shell plc

Test.pdf

[1]

1

2019

2019

TEXT

[“Sustainability Report 2019”]

1qbit_edited_kpi_extraction_Carolin.xlsx

OG

  • company: The name of the company being analyzed.

  • source_file: The document from which data is extracted.

  • source_page: The page number(s) containing the relevant information.

  • kpi_id: The ID of the KPI associated with the data.

  • year: The year to which the data refers.

  • answer: The specific data or text extracted as an answer.

  • data_type: The type of the extracted data (e.g., TEXT or TABLE).

  • relevant_paragraphs: The paragraph(s) or context where the data was found.

  • annotator: The person or tool that performed the annotation.

  • sector: The industry sector the company belongs to.

You can find demo files in the demo/curation/input folder.

Assume the folder structure is as follows:

project/
├-input/
│ ├-data_from_extraction/
│ │ ├-file_1.json
│ │ ├-file_2.json
│ │ └─file_3.json
│ ├-kpi_mapping/
│ │ └─kpi_mapping.csv
│ ├-annotations_file/
│ │ └─annotations_file.xlsx
├-logs/
└─output/

Now, you can run the following command to curate a new training data set:

$ osc-transformer-presteps relevance-curation run-local-curation 'input/-data_from_extraction/file_1.json' 'input/annotations_file/annotations_file.xlsx' 'input/kpi_mapping/kpi_mapping.csv'

Note: The previous comment may need some adjustment when running on different machine like windows due to the slash.

Example 3: Curating a New Training Data Set for KPI Detector

To perform curation, you will need the extracted json files and kpi mappinf file and annotations file (the same as described above).

Assume the folder structure is as follows:

project/
├-input/
│ ├-data_from_extraction/
│ │ ├-file_1.json
│ │ ├-file_2.json
│ │ └─file_3.json
│ ├-kpi_mapping/
│ │ └─kpi_mapping.csv
│ ├-annotations_file/
│ │ └─annotations_file.xlsx
│ ├-relevance_detection_file/
│ │ └─relevance_detection.csv
├-logs/
└─output/

Now, you can run the following command to curate a new training data set:

$ osc-transformer-presteps kpi-curation run-local-kpi-curation  'input/annotations_file/' 'input/data_from_extraction/' 'output/' 'kpi_mapping/kpi_mapping_file.csv' 'relevance_detection_file/relevance_file.xlsx'  --val-ratio 0.2 --agg-annotation "" --find-new-answerable --create-unanswerable

Note: The previous comment may need some adjustment when running on different machine like windows due to the slash.

Important Note on Annotations

When performing curation, it is crucial that all JSON files used for this process are listed in the demo/curation/input/test_annotation.xlsx file. Failure to include these files in the annotation file will result in corrupted output.

Ensure that every JSON file involved in the curation process is mentioned in the annotation file to maintain the integrity of the resulting output.

Developer space

Use Code Directly Without CLI via Github Repository

First, clone the repository to your local environment:

$ git clone https://github.com/os-climate/osc-transformer-presteps

We are using pdm to manage the packages and tox for a stable test framework. First, install pdm (possibly in a virtual environment) via:

$ pip install pdm

Afterwards, sync your system via:

$ pdm sync

You will find multiple demos on how to proceed in the demo folder.

pdm

To add new dependencies, use pdm. For example, you can add numpy via:

$ pdm add numpy

For more detailed descriptions, check the PDM project homepage.

tox

For running linting tools, we use tox. You can run this outside of your virtual environment:

$ pip install tox
$ tox -e lint
$ tox -e test

This will automatically apply checks on your code and run the provided pytests. See more details on tox.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

osc_transformer_presteps-0.1.8.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

osc_transformer_presteps-0.1.8-py3-none-any.whl (48.0 kB view details)

Uploaded Python 3

File details

Details for the file osc_transformer_presteps-0.1.8.tar.gz.

File metadata

  • Download URL: osc_transformer_presteps-0.1.8.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.8

File hashes

Hashes for osc_transformer_presteps-0.1.8.tar.gz
Algorithm Hash digest
SHA256 432a6ba35b0f4e6d3a1abae4b548197fc1792abb74ae00bb4bdc0925df1a9257
MD5 62c75e4e75ffe69721e878e7e7134971
BLAKE2b-256 f0c36d64dbc01b675444951f3e0fcb18ff39384b5497235c7bbd81e9458ce363

See more details on using hashes here.

Provenance

The following attestation bundles were made for osc_transformer_presteps-0.1.8.tar.gz:

Publisher: release.yaml on os-climate/osc-transformer-presteps

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file osc_transformer_presteps-0.1.8-py3-none-any.whl.

File metadata

File hashes

Hashes for osc_transformer_presteps-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 39c8d8b65d30d9258f0269dac6b84e32cf115c114ea6b76e17b39b471f1441f6
MD5 08a170813a82ef5147508a4d6bf89a8b
BLAKE2b-256 5ddd2732132947c7874870eda27d427f1bbbb1f50f3316dcfa7084c28146aaf0

See more details on using hashes here.

Provenance

The following attestation bundles were made for osc_transformer_presteps-0.1.8-py3-none-any.whl:

Publisher: release.yaml on os-climate/osc-transformer-presteps

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page