OS-Climate Data Extraction Tool
Project description
OS-Climate Data Extraction Tool
This project provides an CLI tool and python scripts to train a HuggingFace Transformer model or a local Transformer model and perform inference with it. The primary goal of the inference is to determine the relevance between a given question and context.
Quick Start
To install the OSC Transformer Based Extractor CLI, use pip:
$ pip install osc-transformer-based-extractor
Afterwards you can use the tooling as a CLI tool by simply typing:
$ osc-transformer-based-extractor
We are using typer to have a nice CLI tool here. All details and help will be shown in the CLI tool itself and are not described here in more detail.
Example: Assume the folder structure is like that:
project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│ └── (json files for inference command)
├── model/
│ └── (model-related files go here)
|── saved__model/
| └── (output files trained models)
├── output/
│ └── (ouput files from inference command)
Then you can now simply run (after installation of osc-transformer-based-extractor) the following command to fine-tune the model on the data:
$ osc-transformer-based-extractor relevance-detector fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--num_labels 2 \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--output_dir "project/saved__model/" \
--save_steps 500
Also, the following command can be run to perform inference:
$ osc-transformer-based-extractor relevance-detector perform-inference \
--folder_path "project/data/" \
--kpi_mapping_path "project/kpi_mapping.csv" \
--output_path "project/output/" \
--model_path "project/model/" \
--tokenizer_path "project/model/" \
--threshold 0.5
Training Data
Training File
- To train the model, you need a CSV file with columns:
Question
Context
Label
Also additionally, the output of the https://github.com/os-climate/osc-transformer-presteps module can also be used. the output will look like following Sample Data:
Question |
Context |
Label |
Company |
Source File |
Source Page |
KPI ID |
Year |
Answer |
Data Type |
Annotator |
Index |
---|---|---|---|---|---|---|---|---|---|---|---|
What is the company name? |
The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. |
0 |
NOVATEK |
04_NOVATEK_AR_2016_ENG_11.pdf |
[‘0’] |
0 |
2016 |
PAO NOVATEK |
TEXT |
train_anno_large.xlsx |
1022 |
KPI Mapping File
The Inference command will need a kpi-mapping.csv file, which looks like:
kpi_id |
question |
sectors |
add_year |
kpi_category |
---|---|---|---|---|
1 |
In which year was the annual report or the sustainability report published? |
OG, CM, CU |
FALSE |
TEXT |
Developer Notes
Use code directly without CLI via Github Repository
First clone the repository to your local environment:
$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
We are using pdm to manage the packages and tox for a stable test framework. Hence, first install pdm (possibly in a virtual environment) via:
$ pip install pdm
Afterwards sync you system via:
$ pdm sync
Now you have multiple demos on how to go on. See folder [here](demo)
pdm
For adding new dependencies use pdm. You could add new packages via pdm add. For example numpy via:
$ pdm add numpy
For a very detailed description check the homepage of the pdm project:
tox
For running linting tools we use tox which you run outside of your virtual environment:
$ pip install tox $ tox -e lint $ tox -e test
This will automatically apply some checks on your code and run the provided pytests. See more details on tox on the homepage of the tox project:
Contributing
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.
All contributions (including pull requests) must agree to the Developer Certificate of Origin (DCO) version 1.1. This is exactly the same one created and used by the Linux kernel developers and posted on http://developercertificate.org/. This is a developer’s certification that he or she has the right to submit the patch for inclusion into the project. Simply submitting a contribution implies this agreement, however, please include a “Signed-off-by” tag in every patch (this tag is a conventional way to confirm that you agree to the DCO).
On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file osc_transformer_based_extractor-0.1.2.tar.gz
.
File metadata
- Download URL: osc_transformer_based_extractor-0.1.2.tar.gz
- Upload date:
- Size: 22.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f851f7226a208f91e4679aa28a50e6feb763732814b21068668f9f064a782504 |
|
MD5 | 7672ae35e445f16fede114d9c0338621 |
|
BLAKE2b-256 | 26cda2508920b8050be14df673ffc8a0cedf8330010f29dd8034ad1b1483a1ae |
File details
Details for the file osc_transformer_based_extractor-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: osc_transformer_based_extractor-0.1.2-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6bc2bd9a99fc568bb946e93e13406b725b589a8213b2ca39fb2ef9f8239e230 |
|
MD5 | 6169b1ac34529f433003259c31f04b8a |
|
BLAKE2b-256 | cb19885508306c0e50655d490c23759c73fed96babd0922624ae7452b1393f26 |