Skip to main content

Large language model to corpus

Project description

PyPI version

Introduction

The goal of this tool is to apply Large Language Models operations to monolingual corpus to generate parallell corpus.

Uses cases:

  • Asking a model to translate, summarize, paraphrasing original sentence to be able to benchmark its performance
  • For corpus generation tasks from monolingual corpus, like for example, translated corpus.
  • When developing prompts for your application, enables to test the prompt over a list of sentence to do evaluations

You basically provide an input file and prompt and it generates a target corpus: Alt text

Quick start

For example, to use OpenAI ChatGPT to translate a file:

llm-to-corpus samples/eng.txt samples/fra.txt "translate to French"

To see models and options available:

llm-to-corpus --help

Usage

Evaluation with Chatgpt

Translate Flores200 corpus to evalute quality of Catalan translation

llm-to-corpus samples/flores200.eng chatgpt.txt "Translate to Catalan the following text:"
pip install sacrebleu
sacrebleu samples/flores200.cat -i chatgpt.txt -m bleu chrf --format text

Evaluation with Bloom

Translate Flores200 corpus to evalute quality of Catalan translation

llm-to-corpus samples/flores200.eng bloom.txt "Translate to Catalan the following text:" --model mt0-xxl-mt
pip install sacrebleu
sacrebleu samples/flores200.cat -i bloom.txt -m bleu chrf --format text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm-to-corpus-0.0.3.tar.gz (5.5 kB view hashes)

Uploaded Source

Built Distribution

llm_to_corpus-0.0.3-py3-none-any.whl (10.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page