How Well Can Language Models Answer Questions in Czech?
Project description
Czech-SimpleQA
Problems and answers from OpenAI's SimpleQA eval translated into Czech. This work is based on the data from the paper:
Measuring short-form factuality in large language models Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus arXiv preprint arXiv:2411.04368, 2024. https://arxiv.org/abs/2411.04368
| model | SimpleQA[^1] | Czech-SimpleQA |
|---|---|---|
| gpt-4o-mini-2024-07-18 | 9.5 | 8.1 |
| gpt-4o-2024-11-20 | 38.8 | 31.4 |
| claude-3-5-sonnet-20240620 | 35.0 | 25.8 |
| claude-3-5-sonnet-20241022 | N/A | 31.1 |
| claude-3-5-haiku-20241022 | N/A | 9.3 |
There is a post on my blog with more detailed results! [^1]: As reported in the SimpleQA README.md and in the paper.
What the Data Looks Like:
| problem | target | czech_problem | czech_target |
|---|---|---|---|
| What was the population count in the 2011 census of the Republic of Nauru? | 10,084 | Jaký byl počet obyvatel při sčítání lidu v roce 2011 v Republice Nauru? | 10 084 |
I Just Want the Eval Data
The file with the data lives at src/czech_simpleqa/czech_simpleqa.csv.gz, this is the full URL.
Getting it with pandas looks like this:
import pandas as pd
eval_data = pd.read_csv(
"https://raw.githubusercontent.com/jancervenka/"
"czech-simpleqa/refs/heads/main/src/czech_simpleqa/czech_simpleqa.csv.gz"
)
I Want to Use the Python Package
The package contains everything required to run the eval end-to-end and collect the results.
You can install it with pip or any other Python package manager:
pip install czech-simpleqa
python -m czech_simpleqa.eval \
--answering_model claude-3-5-haiku-20241022 \
--grading_model gpt-4o \
--output_file_path output/claude-3-5-haiku-20241022.csv \
--max_concurrent_tasks 30
CLI Arguments
--answering_model: Model that will generate predicted answers to the problems in the eval.--grading_model: Model that will grade the predicted answers from the answering model.--output_file_path: Where to store the.csvfile with the eval results.--max_concurrent_tasks: Maximum number of concurrent model calls (default 20).
Output File Schema
| problem | target | predicted_answer | grade |
|---|---|---|---|
| Jaké je rozlišení Cat B15 Q v pixelech? | 480 x 800 | Cat B15 Q má rozlišení 480 x 800 pixelů. | A |
Supported Models
Models from OpenAI and Anthropic are currently supported. Environment variables OPENAI_API_KEY or
ANTHROPIC_API_KEY need to be configured.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file czech_simpleqa-0.1.0.tar.gz.
File metadata
- Download URL: czech_simpleqa-0.1.0.tar.gz
- Upload date:
- Size: 908.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
022ebe2f3e91f4cb3be31ea5540730d51923df243486bd8eb96cc8fa316df751
|
|
| MD5 |
ecb44ba070a6dff4ff42be42bd81b4fb
|
|
| BLAKE2b-256 |
6185b15fcb1d022c1de8f532561033414f35d815f2a799eba29e4db23b98d92a
|
File details
Details for the file czech_simpleqa-0.1.0-py3-none-any.whl.
File metadata
- Download URL: czech_simpleqa-0.1.0-py3-none-any.whl
- Upload date:
- Size: 865.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbc13c52def5586a2d70eec825a3e41d7e61815a8ded80c5ccd84c4876a0fd75
|
|
| MD5 |
89aefb773c7efbba952360cf66ffe549
|
|
| BLAKE2b-256 |
3aa58be57422ac818663bb91a3cc1bb9c551c3a815d823a22f8a01fac181f535
|