Skip to main content

Large language model to corpus

Project description

Introduction

The goal of this tool is to apply Large Language Models operations to monolingual corpus to generate parallell corpus.

Uses cases:

  • Asking a model to translate, summarize, paraphrasing original sentence to be able to benchmark its performance
  • For corpus generation tasks from monolingual corpus, like for example, translated corpus.
  • When developing prompts for your application, enables to test the prompt over a list of sentence to do evaluations

You basically provide an input file and prompt and it generates a target corpus: Alt text

Quick start

For example, to use OpenAI ChatGPT to translate a file:

llm-to-corpus samples/eng.txt samples/fra.txt "translate to French"

To see models and options available:

llm-to-corpus --help

Usage

Evaluation with Chatgpt

Translate Flores200 corpus to evalute quality of Catalan translation

llm-to-corpus samples/flores200.eng chatgpt.txt "Translate to Catalan the following text:"
pip install sacrebleu
sacrebleu samples/flores200.cat -i chatgpt.txt -m bleu chrf --format text

Evaluation with Bloom

Translate Flores200 corpus to evalute quality of Catalan translation

llm-to-corpus samples/flores200.eng bloom.txt "Translate to Catalan the following text:" --model mt0-xxl-mt
pip install sacrebleu
sacrebleu samples/flores200.cat -i bloom.txt -m bleu chrf --format text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm-to-corpus-0.0.1.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

llm_to_corpus-0.0.1-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file llm-to-corpus-0.0.1.tar.gz.

File metadata

  • Download URL: llm-to-corpus-0.0.1.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for llm-to-corpus-0.0.1.tar.gz
Algorithm Hash digest
SHA256 91fe8991a4c21a3b33a2d9e1adaf6181f2f1b9e9cb2e2a5deb2648bf4a535259
MD5 c4ad1e070a8a5973d68d2cd7f581f149
BLAKE2b-256 a35e614240929310b32e66da52bc92277997e651f56310cde2300e16b900c327

See more details on using hashes here.

File details

Details for the file llm_to_corpus-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_to_corpus-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d9c791d068009a52f642541162e29f47738e563d512e9a2d6b424dc9606e9e42
MD5 ce39a1b867b9654c033712729eade78c
BLAKE2b-256 b55d625946eb77848deb373110b771635d90c742ef8c6029ada25c7c90419c31

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page