Skip to main content

OS-Climate Data Extraction Tool

Project description

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Build Status Built using PDM Project generated with PyScaffold

OS-Climate Data Extraction Tool

This project provides a CLI tool and Python scripts to train Transformer models (via Hugging Face) for two primary tasks: 1. Relevance Detection: Determines if a question-context pair is relevant. 2. KPI Detection: Fine-tunes models to extract key performance indicators (KPIs) from datasets like annual reports and perform inference.

Quick Start

To install the tool, use pip:

$ pip install osc-transformer-based-extractor

After installation, you can access the CLI tool with:

$ osc-transformer-based-extractor

This command will show the available commands and help via Typer, our CLI library.

Commands and Workflow

1. Relevance Detection

Fine-tuning the Model:

Assume your project structure looks like this:

project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│   └── (JSON files for inference)
├── model/
│   └── (Model-related files)
├── saved__model/
│   └── (Output from training)
├── output/
│   └── (Results from inference)

Use the following command to fine-tune the model:

$ osc-transformer-based-extractor relevance-detector fine-tune \
  --data_path "project/training_data.csv" \
  --model_name "bert-base-uncased" \
  --num_labels 2 \
  --max_length 128 \
  --epochs 3 \
  --batch_size 16 \
  --output_dir "project/saved__model/" \
  --save_steps 500

Running Inference:

$ osc-transformer-based-extractor relevance-detector perform-inference \
  --folder_path "project/data/" \
  --kpi_mapping_path "project/kpi_mapping.csv" \
  --output_path "project/output/" \
  --model_path "project/model/" \
  --tokenizer_path "project/model/" \
  --threshold 0.5

2. KPI Detection

The KPI detection functionality includes fine-tuning and inference.

Fine-tuning the KPI Model:

Assume your project structure looks like this:

project/
│
├── kpi_mapping.csv
├── training_data.csv
│
├── model/
│   └── (model-related files, e.g., tokenizer, config, checkpoints)
│
├── saved__model/
│   └── (Folder to store output from fine-tuning)
│
├── output/
│   └── (output files, e.g., inference_results.xlsx)
$ osc-transformer-based-extractor kpi-detection fine-tune \
    --data_path "project/training_data.csv" \
    --model_name "bert-base-uncased" \
    --max_length 128 \
    --epochs 3 \
    --batch_size 16 \
    --learning_rate 5e-5 \
    --output_dir "project/saved__model/" \
    --save_steps 500

Performing Inference:

$ osc-transformer-based-extractor kpi-detection inference \
    --data_file_path "project/data/input_dataset.csv" \
    --output_path "project/output/inference_results.xlsx" \
    --model_path "project/model/"

Training Data Requirements

  1. Relevance Detection Training File:

The training file should have the following columns: - Question - Context - Label

Example:

Training Data Example

Question

Context

Label

What is the company name?

The Company is exposed to a risk…

0

  1. KPI Detection Training File:

For KPI detection, the dataset should have these additional columns:

KPI Detection Training Example

Question

Context

Label

Company

Source File

KPI ID

Year

Answer

Data Type

What is the company name?

0

NOVATEK

04_NOVATEK_AR_2016_ENG_11.pdf

0

2016

PAO NOVATEK

TEXT

  1. KPI Mapping File:

KPI Mapping File Example

kpi_id

question

sectors

add_year

kpi_category

1

In which year was the annual report…

OG, CM, CU

FALSE

TEXT

Developer Notes

Local Development

Clone the repository:

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/

We use pdm for package management and tox for testing.

  1. Install pdm:

    $ pip install pdm
  2. Sync dependencies:

    $ pdm sync
  3. Add new packages (e.g., numpy):

    $ pdm add numpy
  4. Run tox for linting and testing:

    $ pip install tox
    $ tox -e lint
    $ tox -e test

Contributing

We welcome contributions! Please fork the repository and submit a pull request. Ensure you sign off each commit with the Developer Certificate of Origin (DCO). Read more: http://developercertificate.org/.

Governance Transition

On June 26, 2024, the Linux Foundation announced the merger of FINOS with OS-Climate. Projects are now transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance).

Shields

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Build Status Built using PDM Project generated with PyScaffold

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

osc_transformer_based_extractor-0.1.6.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file osc_transformer_based_extractor-0.1.6.tar.gz.

File metadata

File hashes

Hashes for osc_transformer_based_extractor-0.1.6.tar.gz
Algorithm Hash digest
SHA256 318186ab77dc392e7143c87a1b25d73ef0f8e86d97cdc3aa09cbdf1a492d8abd
MD5 924a7435512e2fa70ec7172ab8ba18d1
BLAKE2b-256 fe7e23c1fc9ccadd42ff2f7dbfdfe06d3c47009f2f39bce37ac60cb418dc76f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for osc_transformer_based_extractor-0.1.6.tar.gz:

Publisher: release.yaml on os-climate/osc-transformer-based-extractor

Attestations:

File details

Details for the file osc_transformer_based_extractor-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for osc_transformer_based_extractor-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 835b765f6e4328603db4f415edcb7400e71a69b88bec6bdc2d91b8518b6a4d2a
MD5 79c296134d8c97e6d5bd5512ad2ee00d
BLAKE2b-256 ea1dc9ea808d9cf0791f7c3d31e8f9260e57cabd1f9b25cf61331fff531f8396

See more details on using hashes here.

Provenance

The following attestation bundles were made for osc_transformer_based_extractor-0.1.6-py3-none-any.whl:

Publisher: release.yaml on os-climate/osc-transformer-based-extractor

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page