OS-Climate Data Extraction Tool
Project description
OS-Climate Data Extraction Tool
This project provides a CLI tool and Python scripts to train Transformer models (via Hugging Face) for two primary tasks: 1. Relevance Detection: Determines if a question-context pair is relevant. 2. KPI Detection: Fine-tunes models to extract key performance indicators (KPIs) from datasets like annual reports and perform inference.
Quick Start
To install the tool, use pip:
$ pip install osc-transformer-based-extractor
After installation, you can access the CLI tool with:
$ osc-transformer-based-extractor
This command will show the available commands and help via Typer, our CLI library.
Commands and Workflow
1. Relevance Detection
Fine-tuning the Model:
Assume your project structure looks like this:
project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│ └── (JSON files for inference)
├── model/
│ └── (Model-related files)
├── saved__model/
│ └── (Output from training)
├── output/
│ └── (Results from inference)
Use the following command to fine-tune the model:
$ osc-transformer-based-extractor relevance-detector fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--num_labels 2 \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--output_dir "project/saved__model/" \
--save_steps 500
Running Inference:
$ osc-transformer-based-extractor relevance-detector perform-inference \
--folder_path "project/data/" \
--kpi_mapping_path "project/kpi_mapping.csv" \
--output_path "project/output/" \
--model_path "project/model/" \
--tokenizer_path "project/model/" \
--threshold 0.5
2. KPI Detection
The KPI detection functionality includes fine-tuning and inference.
Fine-tuning the KPI Model:
Assume your project structure looks like this:
project/
│
├── kpi_mapping.csv
├── training_data.csv
│
├── model/
│ └── (model-related files, e.g., tokenizer, config, checkpoints)
│
├── saved__model/
│ └── (Folder to store output from fine-tuning)
│
├── output/
│ └── (output files, e.g., inference_results.xlsx)
$ osc-transformer-based-extractor kpi-detection fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--learning_rate 5e-5 \
--output_dir "project/saved__model/" \
--save_steps 500
Performing Inference:
$ osc-transformer-based-extractor kpi-detection inference \
--data_file_path "project/data/input_dataset.csv" \
--output_path "project/output/inference_results.xlsx" \
--model_path "project/model/"
Training Data Requirements
Relevance Detection Training File:
The training file should have the following columns: - Question - Context - Label
Example:
Question |
Context |
Label |
---|---|---|
What is the company name? |
The Company is exposed to a risk… |
0 |
KPI Detection Training File:
For KPI detection, the dataset should have these additional columns:
Question |
Context |
Label |
Company |
Source File |
KPI ID |
Year |
Answer |
Data Type |
---|---|---|---|---|---|---|---|---|
What is the company name? |
… |
0 |
NOVATEK |
04_NOVATEK_AR_2016_ENG_11.pdf |
0 |
2016 |
PAO NOVATEK |
TEXT |
KPI Mapping File:
kpi_id |
question |
sectors |
add_year |
kpi_category |
---|---|---|---|---|
1 |
In which year was the annual report… |
OG, CM, CU |
FALSE |
TEXT |
Developer Notes
Local Development
Clone the repository:
$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
We use pdm for package management and tox for testing.
Install pdm:
$ pip install pdm
Sync dependencies:
$ pdm sync
Add new packages (e.g., numpy):
$ pdm add numpy
Run tox for linting and testing:
$ pip install tox $ tox -e lint $ tox -e test
Contributing
We welcome contributions! Please fork the repository and submit a pull request. Ensure you sign off each commit with the Developer Certificate of Origin (DCO). Read more: http://developercertificate.org/.
Governance Transition
On June 26, 2024, the Linux Foundation announced the merger of FINOS with OS-Climate. Projects are now transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance).
Shields
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file osc_transformer_based_extractor-0.1.6.tar.gz
.
File metadata
- Download URL: osc_transformer_based_extractor-0.1.6.tar.gz
- Upload date:
- Size: 26.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 318186ab77dc392e7143c87a1b25d73ef0f8e86d97cdc3aa09cbdf1a492d8abd |
|
MD5 | 924a7435512e2fa70ec7172ab8ba18d1 |
|
BLAKE2b-256 | fe7e23c1fc9ccadd42ff2f7dbfdfe06d3c47009f2f39bce37ac60cb418dc76f0 |
Provenance
The following attestation bundles were made for osc_transformer_based_extractor-0.1.6.tar.gz
:
Publisher:
release.yaml
on os-climate/osc-transformer-based-extractor
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
osc_transformer_based_extractor-0.1.6.tar.gz
- Subject digest:
318186ab77dc392e7143c87a1b25d73ef0f8e86d97cdc3aa09cbdf1a492d8abd
- Sigstore transparency entry: 148702838
- Sigstore integration time:
- Predicate type:
File details
Details for the file osc_transformer_based_extractor-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: osc_transformer_based_extractor-0.1.6-py3-none-any.whl
- Upload date:
- Size: 27.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 835b765f6e4328603db4f415edcb7400e71a69b88bec6bdc2d91b8518b6a4d2a |
|
MD5 | 79c296134d8c97e6d5bd5512ad2ee00d |
|
BLAKE2b-256 | ea1dc9ea808d9cf0791f7c3d31e8f9260e57cabd1f9b25cf61331fff531f8396 |
Provenance
The following attestation bundles were made for osc_transformer_based_extractor-0.1.6-py3-none-any.whl
:
Publisher:
release.yaml
on os-climate/osc-transformer-based-extractor
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
osc_transformer_based_extractor-0.1.6-py3-none-any.whl
- Subject digest:
835b765f6e4328603db4f415edcb7400e71a69b88bec6bdc2d91b8518b6a4d2a
- Sigstore transparency entry: 148702843
- Sigstore integration time:
- Predicate type: