Language Model based sentences scoring library
Project description
lm-scorer
📃 Language Model based sentences scoring library
Synopsis
This package provides a simple programming interface to score sentences using different ML language models.
A simple CLI is also available for quick prototyping.
You can run it locally or on directly on Colab using this notebook.
Do you believe that this is useful?
Has it saved you time?
Or maybe you simply like it?
If so, support this work with a Star ⭐️.
Install
pip install lm-scorer
Usage
import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer
# Available models
list(LMScorer.supported_model_names())
# => ["gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl", distilgpt2"]
# Load model to cpu or cuda
device = "cuda:0" if torch.cuda.is_available() else "cpu"
scorer = LMScorer.from_pretrained("gpt2", device=device)
# Return token probabilities (provide log=True to return log probabilities)
scorer.tokens_score("I like this package.")
# => (scores, ids, tokens)
# scores = [0.018321, 0.0066431, 0.080633, 0.00060745, 0.27772, 0.0036381]
# ids = [40, 588, 428, 5301, 13, 50256]
# tokens = ["I", "Ġlike", "Ġthis", "Ġpackage", ".", "<|endoftext|>"]
# Compute sentence score as the product of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="prod")
# => 6.0231e-12
# Compute sentence score as the mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="mean")
# => 0.064593
# Compute sentence score as the geometric mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="gmean")
# => 0.013489
# Compute sentence score as the harmonic mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="hmean")
# => 0.0028008
# NB: Computations are done in log space so they should be numerically stable.
CLI
The pip package includes a CLI that you can use to score sentences.
usage: lm-scorer [-h] [--model-name MODEL_NAME] [--tokens] [--log-prob]
[--reduce REDUCE] [--cuda CUDA] [--debug]
sentences-file-path
Get sentences probability using a language model.
positional arguments:
sentences-file-path A file containing sentences to score, one per line. If
- is given as filename it reads from stdin instead.
optional arguments:
-h, --help show this help message and exit
--model-name MODEL_NAME, -m MODEL_NAME
The pretrained language model to use. Can be one of:
gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2.
--tokens, -t If provided it provides the probability of each token
of each sentence.
--log-prob, -lp If provided log probabilities are returned instead.
--reduce REDUCE, -r REDUCE
Reduce strategy applied on token probabilities to get
the sentence score. Available strategies are: prod,
mean, gmean, hmean.
--cuda CUDA If provided it runs the model on the given cuda
device.
--debug If provided it provides additional logging in case of
errors.
Development
You can install this library locally for development using the commands below. If you don't have it already, you need to install poetry first.
# Clone the repo
git clone https://github.com/simonepri/lm-scorer
# CD into the created folder
cd lm-scorer
# Create a virtualenv and install the required dependencies using poetry
poetry install
You can then run commands inside the virtualenv by using poetry run COMMAND
.
Alternatively, you can open a shell inside the virtualenv using poetry shell
.
If you wish to contribute to this project, run the following commands locally before opening a PR and check that no error is reported (warnings are fine).
# Run the code formatter
poetry run task format
# Run the linter
poetry run task lint
# Run the static type checker
poetry run task types
# Run the tests
poetry run task test
Authors
- Simone Primarosa - simonepri
See also the list of contributors who participated in this project.
License
This project is licensed under the MIT License - see the license file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lm_scorer-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6af7673c3b409383a3db38f31d3a6359c0523b4cfbd7f65b30994628e0a1eaf |
|
MD5 | 216e8758a05ea9dfed2fce64d0bc0f18 |
|
BLAKE2b-256 | 87b2ed86aabab59f95ddfbcb1cdbae4dbe5973655c821471ffc8fc47f9fdf6cd |