Skip to main content

Large language model to corpus

Project description

PyPI version

Introduction

The goal of this tool is to apply Large Language Models operations to monolingual corpus to generate parallell corpus.

Uses cases:

  • Asking a model to translate, summarize, paraphrasing original sentence to be able to benchmark its performance
  • For corpus generation tasks from monolingual corpus, like for example, translated corpus.
  • When developing prompts for your application, enables to test the prompt over a list of sentence to do evaluations

You basically provide an input file and prompt and it generates a target corpus: Alt text

Quick start

For example, to use OpenAI ChatGPT to translate a file:

llm-to-corpus samples/eng.txt samples/fra.txt "translate to French"

To see models and options available:

llm-to-corpus --help

Usage

Evaluation with Chatgpt

Translate Flores200 corpus to evalute quality of Catalan translation

llm-to-corpus samples/flores200.eng chatgpt.txt "Translate to Catalan the following text:"
pip install sacrebleu
sacrebleu samples/flores200.cat -i chatgpt.txt -m bleu chrf --format text

Evaluation with Bloom

Translate Flores200 corpus to evalute quality of Catalan translation

llm-to-corpus samples/flores200.eng bloom.txt "Translate to Catalan the following text:" --model mt0-xxl-mt
pip install sacrebleu
sacrebleu samples/flores200.cat -i bloom.txt -m bleu chrf --format text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm-to-corpus-0.0.2.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

llm_to_corpus-0.0.2-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file llm-to-corpus-0.0.2.tar.gz.

File metadata

  • Download URL: llm-to-corpus-0.0.2.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for llm-to-corpus-0.0.2.tar.gz
Algorithm Hash digest
SHA256 518ca29e3deb3f24b64f2c42725467d16110c43ecf1a4fee6c92d0f4f01772d2
MD5 918d840320412fee74e912a60b2e644b
BLAKE2b-256 9d60be647c9fed721e8939df2a6df0d0ba55e3d8fb350bfe0fdbe334038be12a

See more details on using hashes here.

File details

Details for the file llm_to_corpus-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_to_corpus-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7cc0a757108f2e75b110fce7d6781b75c0cee6e1d7727446c3745cf90f118c61
MD5 a4e5daeab66a24422eb95edbb7b738ed
BLAKE2b-256 bc6907bd4cb8a69bfd2e628ef07b0584796a2d1948c6f36440c4d357759d36cf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page