Skip to main content

OS-Climate Data Extraction Tool

Project description

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Built Status Built using PDM Project generated with PyScaffold

OS-Climate Data Extraction Tool

This project provides an CLI tool and python scripts to train a HuggingFace Transformer model or a local Transformer model and perform inference with it. The primary goal of the inference is to determine the relevance between a given question and context.

Quick Start

To install the OSC Transformer Based Extractor CLI, use pip:

$ pip install osc-transformer-based-extractor

Afterwards you can use the tooling as a CLI tool by simply typing:

$ osc-transformer-based-extractor

We are using typer to have a nice CLI tool here. All details and help will be shown in the CLI tool itself and are not described here in more detail.

Example: Assume the folder structure is like that:

project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│   └── (json files for inference command)
├── model/
│   └── (model-related files go here)
|── saved__model/
|   └── (output files trained models)
├── output/
│   └── (ouput files from inference command)

Then you can now simply run (after installation of osc-transformer-based-extractor) the following command to fine-tune the model on the data:

$ osc-transformer-based-extractor relevance-detector fine-tune \
  --data_path "project/training_data.csv" \
  --model_name "bert-base-uncased" \
  --num_labels 2 \
  --max_length 128 \
  --epochs 3 \
  --batch_size 16 \
  --output_dir "project/saved__model/" \
  --save_steps 500

Also, the following command can be run to perform inference:

$ osc-transformer-based-extractor relevance-detector perform-inference \
  --folder_path "project/data/" \
  --kpi_mapping_path "project/kpi_mapping.csv" \
  --output_path "project/output/" \
  --model_path "project/model/" \
  --tokenizer_path "project/model/" \
  --threshold 0.5

Training Data

Training File

To train the model, you need a CSV file with columns:
  • Question

  • Context

  • Label

Also additionally, the output of the https://github.com/os-climate/osc-transformer-presteps module can also be used. the output will look like following Sample Data:

traning_Data.csv

Question

Context

Label

Company

Source File

Source Page

KPI ID

Year

Answer

Data Type

Annotator

Index

What is the company name?

The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash.

0

NOVATEK

04_NOVATEK_AR_2016_ENG_11.pdf

[‘0’]

0

2016

PAO NOVATEK

TEXT

train_anno_large.xlsx

1022

KPI Mapping File

The Inference command will need a kpi-mapping.csv file, which looks like:

kpi_mapping.csv

kpi_id

question

sectors

add_year

kpi_category

1

In which year was the annual report or the sustainability report published?

OG, CM, CU

FALSE

TEXT

Developer Notes

Use code directly without CLI via Github Repository

First clone the repository to your local environment:

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/

We are using pdm to manage the packages and tox for a stable test framework. Hence, first install pdm (possibly in a virtual environment) via:

$ pip install pdm

Afterwards sync you system via:

$ pdm sync

Now you have multiple demos on how to go on. See folder [here](demo)

pdm

For adding new dependencies use pdm. You could add new packages via pdm add. For example numpy via:

$ pdm add numpy

For a very detailed description check the homepage of the pdm project:

https://pdm-project.org/en/latest/

tox

For running linting tools we use tox which you run outside of your virtual environment:

$ pip install tox
$ tox -e lint
$ tox -e test

This will automatically apply some checks on your code and run the provided pytests. See more details on tox on the homepage of the tox project:

https://tox.wiki/en/4.16.0/

Contributing

Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.

All contributions (including pull requests) must agree to the Developer Certificate of Origin (DCO) version 1.1. This is exactly the same one created and used by the Linux kernel developers and posted on http://developercertificate.org/. This is a developer’s certification that he or she has the right to submit the patch for inclusion into the project. Simply submitting a contribution implies this agreement, however, please include a “Signed-off-by” tag in every patch (this tag is a conventional way to confirm that you agree to the DCO).

On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

osc_transformer_based_extractor-0.1.3.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file osc_transformer_based_extractor-0.1.3.tar.gz.

File metadata

File hashes

Hashes for osc_transformer_based_extractor-0.1.3.tar.gz
Algorithm Hash digest
SHA256 9afd357181276fd9d32864520fa43c3b9281d8515fe213cc65421a405f390783
MD5 37a35ec96b2aabdb53365fdd9dcc1cc6
BLAKE2b-256 4c6d63fb55a8dbfc56dd54872230a01429331d151042c737a83968550dc108ac

See more details on using hashes here.

File details

Details for the file osc_transformer_based_extractor-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for osc_transformer_based_extractor-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2a8bc5ecda1824805c74b3bec930d52f8bf3316089bdcae6a1c0a25a0af5c50a
MD5 ab2d6ef820e9d723e5aa754abae1c1ab
BLAKE2b-256 15d9ef02f192d87429fb2b297f401a81b4a1345fd1294169d887ca7ab05d3393

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page