A package for generating multilingual symbolic GSM math problems
Project description
multilingual-gsm-symbolic
A Python package for generating synthetic multilingual math problems from symbolic templates. Allows you to create more than a thousand examples from just one problem and allows you to test if the LLMs actually understand the problem or whether it was just lucky pattern-matching.
⏳ Installation
pip install multilingual-gsm-symbolic
👩💻 Get started
from multilingual_gsm_symbolic import load_data, load_replacements, available_languages
# see possible languages
languages = available_languages()
lang = "eng"
print(languages[lang])
# {"number of samples": 100}
# Load English templates (default)
templates = load_data(lang)
# Load language-specific replacement values (used in some templates)
replacements = load_replacements(lang)
# Generate concrete questions from a template
template = templates[0]
questions = template.generate_questions(n=10, language="eng", replacements=replacements)
for q in questions:
print(q.question)
print(q.answer)
print()
📋 Template format
Templates are JSON files with four fields:
| Field | Description |
|---|---|
question |
Concrete question (the original example) |
answer |
Concrete answer with calculation steps |
question_annotated |
Template with variable placeholders and #init / #conditions / #answer sections |
answer_annotated |
Answer template with inline expressions |
Annotated question syntax
{variable, default_value} — placeholder in the question text
#init:
- $var = range(low, high) — variable sampled from a range
- $var = sample([a, b, c]) — variable sampled from a list
#conditions:
- is_int(x / y) — constraint that must hold for a combination to be valid
#answer: x * y + z — Python expression evaluated to produce the numeric answer
Example: fog bank problem
{
"question": "A fog bank rolls in over a city at 3 miles/hour. The city is 42 miles wide. How many hours will it take for the fog bank to cover the city?",
"question_annotated": "A fog bank rolls in over a city at {speed,3} miles/hour. The city is {width,42} miles wide. How many hours will it take for the fog bank to cover the city?\n#init:\n- $speed = range(1, 20)\n- $width = range(2, 100)\n#conditions:\n- is_int(width / speed)\n#answer: width // speed",
"answer": "At 3 miles/hour, it will take 42/3=14 hours for the fog to cover the city.",
"answer_annotated": "At {speed} miles/hour, it will take {width}/{speed}={width//speed} hours for the fog to cover the city."
}
Example: shopping problem
{
"question": "A store sells apples for $2 each and oranges for $3 each. If you buy 4 apples and 5 oranges, how much do you spend?",
"question_annotated": "A store sells apples for ${apple_price,2} each and oranges for ${orange_price,3} each. If you buy {n_apples,4} apples and {n_oranges,5} oranges, how much do you spend?\n#init:\n- $apple_price = range(1, 10)\n- $orange_price = range(1, 10)\n- $n_apples = range(1, 20)\n- $n_oranges = range(1, 20)\n#conditions:\n- True\n#answer: apple_price * n_apples + orange_price * n_oranges",
"answer": "You spend 4*2 + 5*3 = 8 + 15 = $23.",
"answer_annotated": "You spend {n_apples}*{apple_price} + {n_oranges}*{orange_price} = {n_apples*apple_price} + {n_oranges*orange_price} = ${apple_price*n_apples + orange_price*n_oranges}."
}
Available helper functions
| Function | Description |
|---|---|
range(start, end[, step]) |
All integers in [start, end) |
range_list(start, end[, step]) |
Same as range — explicit alias |
range_str(start, end, step, numbers) |
Pairs (number, index) from a list |
arange(start, end[, step]) |
Sample from evenly-spaced floats |
sample(items[, n]) |
One value (or n values) from a list |
sample_sequential(items, n) |
n consecutive items starting at a random index |
is_int(x) |
True if x is a whole number |
divides(a, b) |
True if a % b == 0 (returns False if b == 0) |
Fraction(x) |
Format x as a fraction string, e.g. "3/4" |
🗃️ Data
The English templates are derived from Apple's GSM-Symbolic paper. The Danish templates are manual translations and localizations of the English set, validated both computationally and manually. The original concrete problems are from GSM8k.
| Language | Code | Templates |
|---|---|---|
| English | eng |
100 |
| Danish | dan |
100 |
📖 API reference
function load_data
load_data(language="eng", directory=None) → list[AnnotatedQuestion]
Load symbolic templates.
| Argument | Type | Description |
|---|---|---|
language |
str |
Language code, e.g. "eng" (default) or "dan" |
directory |
Path | None |
Override the bundled data; load templates from this path instead |
| RETURNS | list[AnnotatedQuestion] |
The loaded templates |
function load_replacements
load_replacements(language="eng") → dict
Load language-specific named values (e.g. lists of names, places) used inside templates.
| Argument | Type | Description |
|---|---|---|
language |
str |
Language code, e.g. "eng" (default) |
| RETURNS | dict |
Mapping of replacement name → value list |
function load_gsm
load_gsm(language="eng", directory=None) → list[GSMProblem]
Load the bundled concrete problems for a given language.
| Argument | Type | Description |
|---|---|---|
language |
str |
Language code, e.g. "eng" (default) |
directory |
Path | None |
Override the bundled data directory |
| RETURNS | list[GSMProblem] |
The loaded concrete problems |
class AnnotatedQuestion
Core class representing a symbolic template. Constructed from a JSON template file via AnnotatedQuestion.from_json(path).
method AnnotatedQuestion.generate_questions
Generate concrete Question instances from the template.
| Argument | Type | Description |
|---|---|---|
n |
int |
Number of questions to generate |
language |
str |
Language code for rendered text |
replacements |
dict |
Replacement values from load_replacements |
| RETURNS | list[Question] |
The generated questions |
method AnnotatedQuestion.get_default_assignments
Extract the example variable values from the template.
| Argument | Type | Description |
|---|---|---|
replacements |
dict |
Replacement values from load_replacements |
| RETURNS | dict |
Mapping of variable name → default value |
method AnnotatedQuestion.format_question
Render the question text for a given variable assignment.
| Argument | Type | Description |
|---|---|---|
assignments |
dict |
Variable name → value mapping |
language |
str |
Language code for rendered text |
| RETURNS | str |
The rendered question string |
method AnnotatedQuestion.format_answer
Render the answer text for a given variable assignment.
| Argument | Type | Description |
|---|---|---|
assignments |
dict |
Variable name → value mapping |
language |
str |
Language code for rendered text |
| RETURNS | str |
The rendered answer string |
class Question
Dataclass holding a single generated problem.
| Attribute | Type | Description |
|---|---|---|
question |
str |
The rendered question text |
answer |
str |
The rendered answer text |
id_orig |
int |
Index of the original template |
id_shuffled |
int |
Index within the shuffled sample |
class GSMProblem
Pydantic model for a concrete problem loaded from disk.
| Attribute | Type | Description |
|---|---|---|
question |
str |
The question text |
answer |
str |
The answer text |
id_orig |
int |
Original problem index |
filepath |
Path |
Path to the source file on disk |
Acknowledgement
The symbolic template engine and the danish subset were originally developed as part of the m-gsm-symbolic project at the Centre for Humanities Computing by:
The initial template format was derived from Apple's GSM-Symbolic paper and the original concrete problems are from GSM8k.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file multilingual_gsm_symbolic-0.2.0.tar.gz.
File metadata
- Download URL: multilingual_gsm_symbolic-0.2.0.tar.gz
- Upload date:
- Size: 122.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe274352f9557feb9532a905c1bea1fbd0742bdea0019730de8bfc7140fec46a
|
|
| MD5 |
5e4bba3c35b2329b9d09d00ccf12b4fa
|
|
| BLAKE2b-256 |
093156728e89d37bf20de95ae205457396f111240d1b5c4c07a796eadfe33bfb
|
File details
Details for the file multilingual_gsm_symbolic-0.2.0-py3-none-any.whl.
File metadata
- Download URL: multilingual_gsm_symbolic-0.2.0-py3-none-any.whl
- Upload date:
- Size: 207.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbc2597e53aec489615b676fc29ee7d611dc33128bec5eb3636db990eb514ceb
|
|
| MD5 |
ab7ed5f3a388f7e663831cbcb1eb79b2
|
|
| BLAKE2b-256 |
1d03ab001605c7a56ed5789f5736b7c23e86fcdbea79629a9559ae3b79f8ba86
|