Skip to main content

A Python package for measuring the relative multilingual performance of language models across different languages

Project description

Information Parity

A Python package for measuring Information Parity of language models across different languages.

Overview

Information Parity (IP) is a metric that can predict an LLM's capabilities across multiple languages in a task-agnostic manner. It measures how efficiently language models represent/predict text in different languages relative to a reference language (typically English). It uses cross-entropy loss as a proxy for representation efficiency and calculates a parity score between languages.

From Information Parity: Measuring and Predicting the Multilingual Capabilities of Language Models:

We propose a metric called Information Parity (IP) that can predict an LLM's capabilities across multiple languages in a task-agnostic manner. IP is well-motivated from an information theoretic perspective: it is associated with the LLM's efficiency of compressing the text in a given language compared to a reference language.

Key Features

  • Task-Agnostic Evaluation: Predicts language model capabilities across languages without requiring task-specific benchmarks
  • Information Theoretical Foundation: Based on the model's efficiency in predicting text across languages
  • Strong Correlation: Better correlated with existing task-specific benchmark scores compared to other metrics like Tokenization Parity (TP) and Tokenizer Fertility (TF)
  • Model Ranking: Useful for ranking multilingual LLM capabilities regardless of the downstream task

How Information Parity Works

Roughly speaking, for text in language L, IP is the ratio between the English variant of the text's negative log-likelihood and the language L text's negative log-likelihood. The library calculates how efficiently a language model can predict tokens in different languages by:

  1. Computing the log loss (cross-entropy) for text in a reference language (usually English)
  2. Computing the log loss for equivalent text in another language
  3. Calculating the ratio between these values

A parity score of 1.0 indicates equal representation efficiency, while values below 1.0 suggest the model represents the non-reference language less efficiently.

Installation

Requires Python 3.10+:

pip install information-parity

For running the FLORES-200 evaluation script, you'll need to install the optional dependencies:

pip install information-parity[evaluation]

Usage

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from information_parity import InformationParity

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

# Initialize InformationParity
ip = InformationParity(
    model=model,
    tokenizer=tokenizer,
    is_sentence_piece_tokenizer=True
)

# Calculate parity between English and another language
english_text = "This is an example text."
other_text = "Это пример текста."

parity_score = ip.compute_pair_information_parity(english_text, other_text)
print(f"Information parity score: {parity_score}")

Setting the is_sentence_piece_tokenizer Flag

The is_sentence_piece_tokenizer parameter is crucial for correct tokenization handling:

  • Set to True for models using SentencePiece tokenizers (like Llama, Gemma, and most modern multilingual models)
  • Set to False for models using BPE or other tokenization methods

This flag affects how beginning-of-sequence tokens are handled:

  • When True: The library assumes the tokenizer handles BOS tokens internally
  • When False: The library explicitly adds BOS tokens to the text

Using the wrong setting can lead to inaccurate parity score calculations. For most recent multilingual LLMs (Llama2, Gemma, Mistral), set this to True.

Evaluating Multiple Text Pairs

english_texts = ["First example", "Second example"]
spanish_texts = ["Primer ejemplo", "Segundo ejemplo"]

avg_parity, std_parity = ip.compute_information_parity(english_texts, spanish_texts)
print(f"Average parity: {avg_parity}, Standard deviation: {std_parity}")

Current Limitations

The current implementation processes text pairs sequentially and doesn't utilize GPU batching. This means that for large datasets, evaluation may take considerable time, especially with larger models.

We welcome contributions to improve performance through:

  • Implementing efficient GPU batching
  • Parallelizing computations where possible
  • Optimizing tokenization and inference processes

FLORES-200 Evaluation

This repository includes a script eval_flores_200.py to evaluate information parity across multiple languages using the FLORES-200 dataset:

python eval_flores_200.py

This will:

  1. Load the FLORES-200 dataset
  2. Evaluate information parity across all supported languages (using English as reference)
  3. Display and save detailed results

Supported Models

The library has been evaluated on several variants of open-source LLMs:

  • Llama2
  • Gemma
  • Mistral

Requirements

  • Python ≥ 3.10
  • transformers ≥ 4.51.2
  • torch ≥ 2.0.0
  • numpy ≥ 1.24.0
  • tqdm ≥ 4.64.1
  • datasets ≥ 2.14.0 (optional, for FLORES evaluation)

Contributing

Contributions are welcome! Areas that would particularly benefit from improvements include:

  • GPU batching implementation for faster processing
  • Support for additional model architectures
  • Improved caching mechanisms
  • Optimization of text processing pipelines

To contribute, please open an issue or submit a pull request on the repository.

Citation

If you use this package in your research, please cite the original paper:

@inproceedings{tsvetkov-kipnis-2024-information,
    title = "Information Parity: Measuring and Predicting the Multilingual Capabilities of Language Models",
    author = "Tsvetkov, Alexander  and
      Kipnis, Alon",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.468/",
    doi = "10.18653/v1/2024.findings-emnlp.468",
    pages = "7971--7989",
    abstract = "Large Language Models (LLMs) are increasingly deployed in user-facing applications worldwide, necessitating handling multiple languages across various tasks. We propose a metric called Information Parity (IP) that can predict an LLM`s capabilities across multiple languages in a task-agnostic manner. IP is well-motivated from an information theoretic perspective: it is associated with the LLM`s efficiency of compressing the text in a given language compared to a reference language. We evaluate IP and other popular metrics such as Tokenization Parity (TP) and Tokenizer Fertility (TF) on several variants of open-sourced LLMs (Llama2, Gemma, Mistral). Among all metrics known to us, IP is better correlated with existing task-specific benchmark scores from the literature and thus better predicts such scores in a certain language. These findings show that IP may be useful for ranking multilingual LLMs' capabilities regardless of the downstream task."
}

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

information_parity-0.1.0.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

information_parity-0.1.0-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file information_parity-0.1.0.tar.gz.

File metadata

  • Download URL: information_parity-0.1.0.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for information_parity-0.1.0.tar.gz
Algorithm Hash digest
SHA256 53ff0896ad72bd714b71e3f460ab10b97da94fe89917e8e6de63b62faf959c1b
MD5 6cc7eb3b1ca662c2020996b995bc6eb0
BLAKE2b-256 8da7c1dd4eb8d0a1cc77b711c1fc37374a301687ccc56d67f458dc38e13c54be

See more details on using hashes here.

File details

Details for the file information_parity-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for information_parity-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a83341e2ade7dceb8a83487f76b21c806e660ecd7850c685a1e11c02e622c7c1
MD5 c959d5329d939de369d6b4b090fb4ed4
BLAKE2b-256 cff8d1e6d28fba581ed6183f5314020bff3be0414a57af1678ffdf1b026fe0c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page