Low Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit
Project description
arekit-ss 0.24.0
arekit-ss
[AREkit double "s"] -- is an object-pair context sampler
for datasources,
powered by AREkit
NOTE: For custom text sampling, please follow the ARElight project.
Installation
Install dependencies:
pip install git+https://github.com/nicolay-r/arekit-ss.git@0.24.0
Download AREkit related data, from which sources
are required:
python -m arekit.download_data
Usage
Example of composing prompts:
python -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \
--prompt "For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'" \
--dest_lang en --docs_limit 1
Mind the case (issue #18): switching to another language may affect on amount of extracted data because of
terms_per_context
parameter that crops context by fixed and predefined amount of words.
Parameters
source
-- source name from the list of the supported sources.terms_per_context
-- amount of words (terms) in between SOURCE and TARGET objects.object-source-types
-- filter specific source object typesobject-target-types
-- filter specific target object typesrelation_types
-- list of types, in which items separated with|
char; all by defaultsplits
-- Manual selection of the data-types related splits that should be chosen for the sampling process; types should be separated by ':' sign; for example: 'train:test'
sampler
-- List of the supported samplers:nn
-- CNN/LSTM architecture related, including frames annotation from RuSentiFrames.no-vectorize
-- flag is applicable only fornn
, and denotes no need to generate embeddings for features
bert
-- BERT-based, single-input sequence.prompt
-- prompt-based sampler for LLM systems [prompt engeneering guide]prompt
-- text of the prompt which includes the following parameters:{text}
is an original text of the sample{s_val}
and{t_val}
values of the source and target of the pairs respectively{label_val}
value of the label
writer
-- the output format of samples:mask_entities
-- mask entity mode.- Text translation parameters:
src_lang
-- original language of the text.dest_lang
-- target language of the text.
output_dir
-- target directory for samples storing- Limiting the amount of documents from source:
docs_limit
-- amount of documents to be considered for sampling from the whole source.doc_ids
-- list of the document IDs.
Powered by
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file arekit_ss-0.24.0.tar.gz
.
File metadata
- Download URL: arekit_ss-0.24.0.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf37f76d1fd2936cb7a83278b0d944760317888c63d48d80d6f7aa914045f614 |
|
MD5 | c571b355f82113af19805e081bc8c433 |
|
BLAKE2b-256 | 18e464461a438268d63a10cf64218475cf985c6aebf0b6d9d86243ee2bca5b9c |
File details
Details for the file arekit_ss-0.24.0-py3-none-any.whl
.
File metadata
- Download URL: arekit_ss-0.24.0-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a6f85917dc474dc660115bc59d095a7c52bc4d4182f6906a8a053e6054f7896 |
|
MD5 | 1c50964b1b60971fe4503aaeeb6c622a |
|
BLAKE2b-256 | e10d8da3e2918b8ad133acba45f6f9559e733338217a060ac1c75696bebfc4f1 |