Skip to main content

Large language model to corpus

Project description

PyPI version

Introduction

The goal of this tool is to apply Large Language Models operations to monolingual corpus to generate parallell corpus.

Uses cases:

  • Asking a model to translate, summarize, paraphrasing original sentence to be able to benchmark its performance
  • For corpus generation tasks from monolingual corpus, like for example, translated corpus.
  • When developing prompts for your application, enables to test the prompt over a list of sentence to do evaluations

You basically provide an input file and prompt and it generates a target corpus: Alt text

Quick start

For example, to use OpenAI ChatGPT to translate a file:

llm-to-corpus samples/eng.txt samples/fra.txt "translate to French"

To see models and options available:

llm-to-corpus --help

Usage

Evaluation with Chatgpt

Translate Flores200 corpus to evalute quality of Catalan translation

llm-to-corpus samples/flores200.eng chatgpt.txt "Translate to Catalan the following text:"
pip install sacrebleu
sacrebleu samples/flores200.cat -i chatgpt.txt -m bleu chrf --format text

Evaluation with Bloom

Translate Flores200 corpus to evalute quality of Catalan translation

llm-to-corpus samples/flores200.eng bloom.txt "Translate to Catalan the following text:" --model mt0-xxl-mt
pip install sacrebleu
sacrebleu samples/flores200.cat -i bloom.txt -m bleu chrf --format text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm-to-corpus-0.0.3.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

llm_to_corpus-0.0.3-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file llm-to-corpus-0.0.3.tar.gz.

File metadata

  • Download URL: llm-to-corpus-0.0.3.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for llm-to-corpus-0.0.3.tar.gz
Algorithm Hash digest
SHA256 4453cfffd3f53d532bbd7c3b01ec2699f9b6fe0d4add9d7060077bf350b21179
MD5 e98d66ce09fc9961f1637bc62dd04b19
BLAKE2b-256 fe76a18724e39fbbe4a6a2e85d5cf92580061e0a0bc130be66297795474169ee

See more details on using hashes here.

File details

Details for the file llm_to_corpus-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_to_corpus-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2a26923b0dc2b545a2e42d2abc1fe10eef09c81a1897948398b0765291ebe79e
MD5 6b7c4f1c4bdb5b8d4bef60db3265a2f1
BLAKE2b-256 9ab71c4e67ea009805c8b562d167223f462ef5d78a1eb7e687d0541a15d8b548

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page