Skip to main content

OS-Climate Data Extraction Tool

Project description

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Build Status Built using PDM Project generated with PyScaffold

OS-Climate Data Extraction Tool

This project provides a CLI tool and Python scripts to train Transformer models (via Hugging Face) for two primary tasks: 1. Relevance Detection: Determines if a question-context pair is relevant. 2. KPI Detection: Fine-tunes models to extract key performance indicators (KPIs) from datasets like annual reports and perform inference.

Quick Start

To install the tool, use pip:

$ pip install osc-transformer-based-extractor

After installation, you can access the CLI tool with:

$ osc-transformer-based-extractor

This command will show the available commands and help via Typer, our CLI library.

Commands and Workflow

1. Relevance Detection

Fine-tuning the Model:

Assume your project structure looks like this:

project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│   └── (JSON files for inference)
├── model/
│   └── (Model-related files)
├── saved__model/
│   └── (Output from training)
├── output/
│   └── (Results from inference)

Use the following command to fine-tune the model:

$ osc-transformer-based-extractor relevance-detector fine-tune \
  --data_path "project/training_data.csv" \
  --model_name "bert-base-uncased" \
  --num_labels 2 \
  --max_length 128 \
  --epochs 3 \
  --batch_size 16 \
  --output_dir "project/saved__model/" \
  --save_steps 500

Running Inference:

$ osc-transformer-based-extractor relevance-detector perform-inference \
  --folder_path "project/data/" \
  --kpi_mapping_path "project/kpi_mapping.csv" \
  --output_path "project/output/" \
  --model_path "project/model/" \
  --tokenizer_path "project/model/" \
  --threshold 0.5

2. KPI Detection

The KPI detection functionality includes fine-tuning and inference.

Fine-tuning the KPI Model:

Assume your project structure looks like this:

project/
│
├── kpi_mapping.csv
├── training_data.csv
│
├── model/
│   └── (model-related files, e.g., tokenizer, config, checkpoints)
│
├── saved__model/
│   └── (Folder to store output from fine-tuning)
│
├── output/
│   └── (output files, e.g., inference_results.xlsx)
$ osc-transformer-based-extractor kpi-detection fine-tune \
    --data_path "project/training_data.csv" \
    --model_name "bert-base-uncased" \
    --max_length 128 \
    --epochs 3 \
    --batch_size 16 \
    --learning_rate 5e-5 \
    --output_dir "project/saved__model/" \
    --save_steps 500

Performing Inference:

$ osc-transformer-based-extractor kpi-detection inference \
    --data_file_path "project/data/input_dataset.csv" \
    --output_path "project/output/inference_results.xlsx" \
    --model_path "project/model/"

Training Data Requirements

  1. Relevance Detection Training File:

The training file should have the following columns: - Question - Context - Label

Example:

Training Data Example

Question

Context

Label

What is the company name?

The Company is exposed to a risk…

0

  1. KPI Detection Training File:

For KPI detection, the dataset should have these additional columns:

KPI Detection Training Example

Question

Context

Label

Company

Source File

KPI ID

Year

Answer

Data Type

What is the company name?

0

NOVATEK

04_NOVATEK_AR_2016_ENG_11.pdf

0

2016

PAO NOVATEK

TEXT

  1. KPI Mapping File:

KPI Mapping File Example

kpi_id

question

sectors

add_year

kpi_category

1

In which year was the annual report…

OG, CM, CU

FALSE

TEXT

Developer Notes

Local Development

Clone the repository:

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/

We use pdm for package management and tox for testing.

  1. Install pdm:

    $ pip install pdm
  2. Sync dependencies:

    $ pdm sync
  3. Add new packages (e.g., numpy):

    $ pdm add numpy
  4. Run tox for linting and testing:

    $ pip install tox
    $ tox -e lint
    $ tox -e test

Contributing

We welcome contributions! Please fork the repository and submit a pull request. Ensure you sign off each commit with the Developer Certificate of Origin (DCO). Read more: http://developercertificate.org/.

Governance Transition

On June 26, 2024, the Linux Foundation announced the merger of FINOS with OS-Climate. Projects are now transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance).

Shields

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Build Status Built using PDM Project generated with PyScaffold

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

osc_transformer_based_extractor-0.1.7.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file osc_transformer_based_extractor-0.1.7.tar.gz.

File metadata

File hashes

Hashes for osc_transformer_based_extractor-0.1.7.tar.gz
Algorithm Hash digest
SHA256 2e731a09913da410e015a5bde692aa345cd6482061ae5ed9aceb7a58d6f8967e
MD5 6dbcac2f465e0d2d023f2e4fd193694a
BLAKE2b-256 fd62f6d639e90449ebf5b7927520acf8c628620142961dea16e260bde7726d56

See more details on using hashes here.

Provenance

The following attestation bundles were made for osc_transformer_based_extractor-0.1.7.tar.gz:

Publisher: release.yaml on os-climate/osc-transformer-based-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file osc_transformer_based_extractor-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for osc_transformer_based_extractor-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1a65b91d80f338e85237915ef622f7d0c91e75732061a41c478cb9c52ffd2f9b
MD5 a3c74093f44ab520335666f443877424
BLAKE2b-256 b23bf47f71c10e7971a725e56c503b8fea75f1f81d79548722702d5b39cdafa7

See more details on using hashes here.

Provenance

The following attestation bundles were made for osc_transformer_based_extractor-0.1.7-py3-none-any.whl:

Publisher: release.yaml on os-climate/osc-transformer-based-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page