Skip to main content

Low Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit

Project description

arekit-ss 0.24.0

arekit-ss [AREkit double "s"] -- is an object-pair context sampler for datasources, powered by AREkit

NOTE: For custom text sampling, please follow the ARElight project.

Installation

Install dependencies:

pip install git+https://github.com/nicolay-r/arekit-ss.git@0.24.0

Download AREkit related data, from which sources are required:

python -m arekit.download_data

Usage

Example of composing prompts:

python -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \
  --prompt "For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'" \
  --dest_lang en --docs_limit 1

Mind the case (issue #18): switching to another language may affect on amount of extracted data because of terms_per_context parameter that crops context by fixed and predefined amount of words.

Parameters

  • source -- source name from the list of the supported sources.
    • terms_per_context -- amount of words (terms) in between SOURCE and TARGET objects.
    • object-source-types -- filter specific source object types
    • object-target-types -- filter specific target object types
    • relation_types -- list of types, in which items separated with | char; all by default
    • splits -- Manual selection of the data-types related splits that should be chosen for the sampling process; types should be separated by ':' sign; for example: 'train:test'
  • sampler -- List of the supported samplers:
    • nn -- CNN/LSTM architecture related, including frames annotation from RuSentiFrames.
      • no-vectorize -- flag is applicable only for nn, and denotes no need to generate embeddings for features
    • bert -- BERT-based, single-input sequence.
    • prompt -- prompt-based sampler for LLM systems [prompt engeneering guide]
      • prompt -- text of the prompt which includes the following parameters:
        • {text} is an original text of the sample
        • {s_val} and {t_val} values of the source and target of the pairs respectively
        • {label_val} value of the label
  • writer -- the output format of samples:
    • csv -- for AREnets framework;
    • jsonl -- for OpenNRE framework.
    • sqlite -- SQLite-3.0 database.
  • mask_entities -- mask entity mode.
  • Text translation parameters:
    • src_lang -- original language of the text.
    • dest_lang -- target language of the text.
  • output_dir -- target directory for samples storing
  • Limiting the amount of documents from source:
    • docs_limit -- amount of documents to be considered for sampling from the whole source.
    • doc_ids -- list of the document IDs.

output_prompts

Powered by

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arekit_ss-0.24.0.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

arekit_ss-0.24.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file arekit_ss-0.24.0.tar.gz.

File metadata

  • Download URL: arekit_ss-0.24.0.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for arekit_ss-0.24.0.tar.gz
Algorithm Hash digest
SHA256 cf37f76d1fd2936cb7a83278b0d944760317888c63d48d80d6f7aa914045f614
MD5 c571b355f82113af19805e081bc8c433
BLAKE2b-256 18e464461a438268d63a10cf64218475cf985c6aebf0b6d9d86243ee2bca5b9c

See more details on using hashes here.

File details

Details for the file arekit_ss-0.24.0-py3-none-any.whl.

File metadata

  • Download URL: arekit_ss-0.24.0-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for arekit_ss-0.24.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8a6f85917dc474dc660115bc59d095a7c52bc4d4182f6906a8a053e6054f7896
MD5 1c50964b1b60971fe4503aaeeb6c622a
BLAKE2b-256 e10d8da3e2918b8ad133acba45f6f9559e733338217a060ac1c75696bebfc4f1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page