Skip to main content

OS-Climate Data Extraction Tool

Project description

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Build Status Built using PDM Project generated with PyScaffold

OS-Climate Data Extraction Tool

This project provides a CLI tool and Python scripts to train Transformer models (via Hugging Face) for two primary tasks: 1. Relevance Detection: Determines if a question-context pair is relevant. 2. KPI Detection: Fine-tunes models to extract key performance indicators (KPIs) from datasets like annual reports and perform inference.

Quick Start

To install the tool, use pip:

$ pip install osc-transformer-based-extractor

After installation, you can access the CLI tool with:

$ osc-transformer-based-extractor

This command will show the available commands and help via Typer, our CLI library.

Commands and Workflow

1. Relevance Detection

Fine-tuning the Model:

Assume your project structure looks like this:

project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│   └── (JSON files for inference)
├── model/
│   └── (Model-related files)
├── saved__model/
│   └── (Output from training)
├── output/
│   └── (Results from inference)

Use the following command to fine-tune the model:

$ osc-transformer-based-extractor relevance-detector fine-tune \
  --data_path "project/training_data.csv" \
  --model_name "bert-base-uncased" \
  --num_labels 2 \
  --max_length 128 \
  --epochs 3 \
  --batch_size 16 \
  --output_dir "project/saved__model/" \
  --save_steps 500

Running Inference:

$ osc-transformer-based-extractor relevance-detector perform-inference \
  --folder_path "project/data/" \
  --kpi_mapping_path "project/kpi_mapping.csv" \
  --output_path "project/output/" \
  --model_path "project/model/" \
  --tokenizer_path "project/model/" \
  --threshold 0.5

2. KPI Detection

The KPI detection functionality includes fine-tuning and inference.

Fine-tuning the KPI Model:

Assume your project structure looks like this:

project/
│
├── kpi_mapping.csv
├── training_data.csv
│
├── model/
│   └── (model-related files, e.g., tokenizer, config, checkpoints)
│
├── saved__model/
│   └── (Folder to store output from fine-tuning)
│
├── output/
│   └── (output files, e.g., inference_results.xlsx)
$ osc-transformer-based-extractor kpi-detection fine-tune \
    --data_path "project/training_data.csv" \
    --model_name "bert-base-uncased" \
    --max_length 128 \
    --epochs 3 \
    --batch_size 16 \
    --learning_rate 5e-5 \
    --output_dir "project/saved__model/" \
    --save_steps 500

Performing Inference:

$ osc-transformer-based-extractor kpi-detection inference \
    --data_file_path "project/data/input_dataset.csv" \
    --output_path "project/output/inference_results.xlsx" \
    --model_path "project/model/"

Training Data Requirements

  1. Relevance Detection Training File:

The training file should have the following columns: - Question - Context - Label

Example:

Training Data Example

Question

Context

Label

What is the company name?

The Company is exposed to a risk…

0

  1. KPI Detection Training File:

For KPI detection, the dataset should have these additional columns:

KPI Detection Training Example

Question

Context

Label

Company

Source File

KPI ID

Year

Answer

Data Type

What is the company name?

0

NOVATEK

04_NOVATEK_AR_2016_ENG_11.pdf

0

2016

PAO NOVATEK

TEXT

  1. KPI Mapping File:

KPI Mapping File Example

kpi_id

question

sectors

add_year

kpi_category

1

In which year was the annual report…

OG, CM, CU

FALSE

TEXT

Developer Notes

Local Development

Clone the repository:

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/

We use pdm for package management and tox for testing.

  1. Install pdm:

    $ pip install pdm
  2. Sync dependencies:

    $ pdm sync
  3. Add new packages (e.g., numpy):

    $ pdm add numpy
  4. Run tox for linting and testing:

    $ pip install tox
    $ tox -e lint
    $ tox -e test

Contributing

We welcome contributions! Please fork the repository and submit a pull request. Ensure you sign off each commit with the Developer Certificate of Origin (DCO). Read more: http://developercertificate.org/.

Governance Transition

On June 26, 2024, the Linux Foundation announced the merger of FINOS with OS-Climate. Projects are now transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance).

Shields

An OS-Climate Project Join OS-Climate on Slack Source code on GitHub PyPI package Build Status Built using PDM Project generated with PyScaffold

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

osc_transformer_based_extractor-0.1.5.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file osc_transformer_based_extractor-0.1.5.tar.gz.

File metadata

File hashes

Hashes for osc_transformer_based_extractor-0.1.5.tar.gz
Algorithm Hash digest
SHA256 849340e3761a6b76cd3af72bc0a7560b1c0216bd0ee22a2a46d55c37340da5f3
MD5 65bfe78bace35a173aa1d7ff2991a804
BLAKE2b-256 0ed771bde6e7608fc26783634fb5ca4b6851dcd141a779a7838f3c78a9bd0450

See more details on using hashes here.

File details

Details for the file osc_transformer_based_extractor-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for osc_transformer_based_extractor-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 d490d16ddb2c5fdd51f192db829a7173a5f96f2b273d5cebfea4ed22934076fa
MD5 4c42c8dea4db4d8a41c98164d59138ea
BLAKE2b-256 00ca235caa7a68ce27da2fb8105eae168eba4fb40293afe0f83bf234d4c412d0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page