Skip to main content

OS-Climate Data Extraction Tool

Project description

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Built Status Built using PDM Project generated with PyScaffold

OS-Climate Data Extraction Tool

This project provides an CLI tool and python scripts to train a HuggingFace Transformer model or a local Transformer model and perform inference with it. The primary goal of the inference is to determine the relevance between a given question and context.

Quick Start

To install the OSC Transformer Based Extractor CLI, use pip:

$ pip install osc-transformer-based-extractor

Afterwards you can use the tooling as a CLI tool by simply typing:

$ osc-transformer-based-extractor

We are using typer to have a nice CLI tool here. All details and help will be shown in the CLI tool itself and are not described here in more detail.

Example: Assume the folder structure is like that:

project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│   └── (json files for inference command)
├── model/
│   └── (model-related files go here)
|── saved__model/
|   └── (output files trained models)
├── output/
│   └── (ouput files from inference command)

Then you can now simply run (after installation of osc-transformer-based-extractor) the following command to fine-tune the model on the data:

$ osc-transformer-based-extractor relevance-detector fine-tune \
  --data_path "project/training_data.csv" \
  --model_name "bert-base-uncased" \
  --num_labels 2 \
  --max_length 128 \
  --epochs 3 \
  --batch_size 16 \
  --output_dir "project/saved__model/" \
  --save_steps 500

Also, the following command can be run to perform inference:

$ osc-transformer-based-extractor relevance-detector perform-inference \
  --folder_path "project/data/" \
  --kpi_mapping_path "project/kpi_mapping.csv" \
  --output_path "project/output/" \
  --model_path "project/model/" \
  --tokenizer_path "project/model/" \
  --threshold 0.5

Training Data

Training File

To train the model, you need a CSV file with columns:
  • Question

  • Context

  • Label

Also additionally, the output of the https://github.com/os-climate/osc-transformer-presteps module can also be used. the output will look like following Sample Data:

traning_Data.csv

Question

Context

Label

Company

Source File

Source Page

KPI ID

Year

Answer

Data Type

Annotator

Index

What is the company name?

The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash.

0

NOVATEK

04_NOVATEK_AR_2016_ENG_11.pdf

[‘0’]

0

2016

PAO NOVATEK

TEXT

train_anno_large.xlsx

1022

KPI Mapping File

The Inference command will need a kpi-mapping.csv file, which looks like:

kpi_mapping.csv

kpi_id

question

sectors

add_year

kpi_category

1

In which year was the annual report or the sustainability report published?

OG, CM, CU

FALSE

TEXT

Developer Notes

Use code directly without CLI via Github Repository

First clone the repository to your local environment:

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/

We are using pdm to manage the packages and tox for a stable test framework. Hence, first install pdm (possibly in a virtual environment) via:

$ pip install pdm

Afterwards sync you system via:

$ pdm sync

Now you have multiple demos on how to go on. See folder [here](demo)

pdm

For adding new dependencies use pdm. You could add new packages via pdm add. For example numpy via:

$ pdm add numpy

For a very detailed description check the homepage of the pdm project:

https://pdm-project.org/en/latest/

tox

For running linting tools we use tox which you run outside of your virtual environment:

$ pip install tox
$ tox -e lint
$ tox -e test

This will automatically apply some checks on your code and run the provided pytests. See more details on tox on the homepage of the tox project:

https://tox.wiki/en/4.16.0/

Contributing

Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.

All contributions (including pull requests) must agree to the Developer Certificate of Origin (DCO) version 1.1. This is exactly the same one created and used by the Linux kernel developers and posted on http://developercertificate.org/. This is a developer’s certification that he or she has the right to submit the patch for inclusion into the project. Simply submitting a contribution implies this agreement, however, please include a “Signed-off-by” tag in every patch (this tag is a conventional way to confirm that you agree to the DCO).

On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

osc_transformer_based_extractor-0.1.2.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file osc_transformer_based_extractor-0.1.2.tar.gz.

File metadata

File hashes

Hashes for osc_transformer_based_extractor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f851f7226a208f91e4679aa28a50e6feb763732814b21068668f9f064a782504
MD5 7672ae35e445f16fede114d9c0338621
BLAKE2b-256 26cda2508920b8050be14df673ffc8a0cedf8330010f29dd8034ad1b1483a1ae

See more details on using hashes here.

File details

Details for the file osc_transformer_based_extractor-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for osc_transformer_based_extractor-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f6bc2bd9a99fc568bb946e93e13406b725b589a8213b2ca39fb2ef9f8239e230
MD5 6169b1ac34529f433003259c31f04b8a
BLAKE2b-256 cb19885508306c0e50655d490c23759c73fed96babd0922624ae7452b1393f26

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page