Skip to main content

Low Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit

Project description

arekit-ss 0.24.0

arekit-ss [AREkit double "s"] -- is an object-pair context sampler for datasources, powered by AREkit

NOTE: For custom text sampling, please follow the ARElight project.

Installation

Install dependencies:

pip install git+https://github.com/nicolay-r/arekit-ss.git@0.24.0

Download AREkit related data, from which sources are required:

python -m arekit.download_data

Usage

Example of composing prompts:

python -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \
  --prompt "For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'" \
  --dest_lang en --docs_limit 1

Mind the case (issue #18): switching to another language may affect on amount of extracted data because of terms_per_context parameter that crops context by fixed and predefined amount of words.

Parameters

  • source -- source name from the list of the supported sources.
    • terms_per_context -- amount of words (terms) in between SOURCE and TARGET objects.
    • object-source-types -- filter specific source object types
    • object-target-types -- filter specific target object types
    • relation_types -- list of types, in which items separated with | char; all by default
    • splits -- Manual selection of the data-types related splits that should be chosen for the sampling process; types should be separated by ':' sign; for example: 'train:test'
  • sampler -- List of the supported samplers:
    • nn -- CNN/LSTM architecture related, including frames annotation from RuSentiFrames.
      • no-vectorize -- flag is applicable only for nn, and denotes no need to generate embeddings for features
    • bert -- BERT-based, single-input sequence.
    • prompt -- prompt-based sampler for LLM systems [prompt engeneering guide]
      • prompt -- text of the prompt which includes the following parameters:
        • {text} is an original text of the sample
        • {s_val} and {t_val} values of the source and target of the pairs respectively
        • {label_val} value of the label
  • writer -- the output format of samples:
    • csv -- for AREnets framework;
    • jsonl -- for OpenNRE framework.
    • sqlite -- SQLite-3.0 database.
  • mask_entities -- mask entity mode.
  • Text translation parameters:
    • src_lang -- original language of the text.
    • dest_lang -- target language of the text.
  • output_dir -- target directory for samples storing
  • Limiting the amount of documents from source:
    • docs_limit -- amount of documents to be considered for sampling from the whole source.
    • doc_ids -- list of the document IDs.

output_prompts

Powered by

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arekit_ss-0.24.0.tar.gz (17.9 kB view hashes)

Uploaded Source

Built Distribution

arekit_ss-0.24.0-py3-none-any.whl (23.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page