arekit-ss

Low Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit

These details have not been verified by PyPI

Project links

Homepage

Project description

arekit-ss 0.24.0

arekit-ss [AREkit double "s"] -- is an object-pair context sampler for datasources, powered by AREkit

NOTE: For custom text sampling, please follow the ARElight project.

Installation

Install dependencies:

pip install git+https://github.com/nicolay-r/arekit-ss.git@0.24.0

Download AREkit related data, from which sources are required:

python -m arekit.download_data

Usage

Example of composing prompts:

python -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \
  --prompt "For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'" \
  --dest_lang en --docs_limit 1

Mind the case (issue #18): switching to another language may affect on amount of extracted data because of terms_per_context parameter that crops context by fixed and predefined amount of words.

Parameters

source -- source name from the list of the supported sources.
- terms_per_context -- amount of words (terms) in between SOURCE and TARGET objects.
- object-source-types -- filter specific source object types
- object-target-types -- filter specific target object types
- relation_types -- list of types, in which items separated with | char; all by default
- splits -- Manual selection of the data-types related splits that should be chosen for the sampling process; types should be separated by ':' sign; for example: 'train:test'
sampler -- List of the supported samplers:
- nn -- CNN/LSTM architecture related, including frames annotation from RuSentiFrames.
  - no-vectorize -- flag is applicable only for nn, and denotes no need to generate embeddings for features
- bert -- BERT-based, single-input sequence.
- prompt -- prompt-based sampler for LLM systems [prompt engeneering guide]
  - prompt -- text of the prompt which includes the following parameters:
    - {text} is an original text of the sample
    - {s_val} and {t_val} values of the source and target of the pairs respectively
    - {label_val} value of the label
writer -- the output format of samples:
- csv -- for AREnets framework;
- jsonl -- for OpenNRE framework.
- sqlite -- SQLite-3.0 database.
mask_entities -- mask entity mode.
Text translation parameters:
- src_lang -- original language of the text.
- dest_lang -- target language of the text.
output_dir -- target directory for samples storing
Limiting the amount of documents from source:
- docs_limit -- amount of documents to be considered for sampling from the whole source.
- doc_ids -- list of the document IDs.

output_prompts

Powered by

AREkit framework

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.24.0

Nov 7, 2023

0.23.1

Nov 8, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arekit_ss-0.24.0.tar.gz (17.9 kB view details)

Uploaded Nov 7, 2023 Source

Built Distribution

arekit_ss-0.24.0-py3-none-any.whl (23.9 kB view details)

Uploaded Nov 7, 2023 Python 3

File details

Details for the file arekit_ss-0.24.0.tar.gz.

File metadata

Download URL: arekit_ss-0.24.0.tar.gz
Upload date: Nov 7, 2023
Size: 17.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for arekit_ss-0.24.0.tar.gz
Algorithm	Hash digest
SHA256	`cf37f76d1fd2936cb7a83278b0d944760317888c63d48d80d6f7aa914045f614`
MD5	`c571b355f82113af19805e081bc8c433`
BLAKE2b-256	`18e464461a438268d63a10cf64218475cf985c6aebf0b6d9d86243ee2bca5b9c`

See more details on using hashes here.

File details

Details for the file arekit_ss-0.24.0-py3-none-any.whl.

File metadata

Download URL: arekit_ss-0.24.0-py3-none-any.whl
Upload date: Nov 7, 2023
Size: 23.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for arekit_ss-0.24.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8a6f85917dc474dc660115bc59d095a7c52bc4d4182f6906a8a053e6054f7896`
MD5	`1c50964b1b60971fe4503aaeeb6c622a`
BLAKE2b-256	`e10d8da3e2918b8ad133acba45f6f9559e733338217a060ac1c75696bebfc4f1`