a tool to measure the grammatical well-formedness of multilingual texts
Project description
Morphological Evaluation of NLG
L'AMBRE is a tool to measure the grammatical well-formedness of texts generated by NLG systems. It analyzes the dependency parses of the text using morpho-syntactic rules, and returns a well-formedness score. This tool utilizes the Surface Syntactic Universal Dependency (SUD) project both for extracting rules as well as parsing, and is therefore applicable across languages. See our EMNLP 2021 paper for more details.
Getting Started
Install from PyPI
python -m pip install lambre
Try L'AMBRE
For a given input text file, lambre
computes a morpho-syntactic well-formedness score [0-1]. The following method first downloads the parsers and rule sets for the specified language before computing the document-level score. See the output folder (out
) for error visualizations.
>>> import lambre
>>> with open("data/txt/ru.txt", "r") as rf:
... data = rf.readlines()
>>> lambre.score("ru", data)
0.9962
L'AMBRE can also be used from command line. See lambre --help
for more options.
lambre ru data/txt/ru.txt
Morpho-syntactic Rules
lambre
currently supports two rule sets, chaudhary-etal-2021
(see Chaudhary et al., 2020, 2021) and pratapa-etal-2021
(see Pratapa et al., 2021). The former is the default, but the rule set can be specified using --rule-set
option.
Visualization Examples
Along with the overall L'AMBRE score, we write the erroneous sentences to the output folder out/errors
. We provide two visualizations, i) plain text (errors.txt
), ii) HTML (errors/*.html
). For plain text visualization, we use the ipymarkup tool. We use brat and Universal Dependencies for HTML visualizations.
Below is a sample run on 1000 example Hindi sentences from the Samanantar corpus.
>>> import lambre
>>> with open("examples/hi_sents_1k.txt", "r") as rf:
... data = rf.readlines()
>>> lambre.score("hi", data)
0.8821
A few erroneous sentences from this corpus (as detected by L'AMBRE):
Input sentence: संख्या की स्टाफ प्रशिक्षित में हिंदी/वाले जानने हिंदी
(Stenography Hindi in trained persons of No.)
तीनवर्षतक अग्रनीत किए| जाने के बाद व्यपगत हुए| आरक्षणों की| संख्या
(of after forward No. reservations lapsed carrying for 3 years)
Below, we show the visualizations of word order related errors for the above two sentences. We also generate separate files for agreement and case marking (see examples/ for full HTML outputs).
Parser
We provide SUD parsers trained using Stanza toolkit. See section 4 in our paper for more details.
Supported Languages
We currently support the following languages. lambre
automatically downloads the necessary language-specific resources (when available).
Language | Code | Language | Code | Language | Code | Language | Code |
---|---|---|---|---|---|---|---|
Catalan | ca | Spanish | es | Italian | it | Russian | ru |
Czech | cs | Estonian | et | Latvian | lv | Slovenian | sl |
Danish | da | Persian | fa | Dutch | nl | Swedish | sv |
German | de | French | fr | Polish | pl | Ukrainian | uk |
Greek | el | Hindi | hi | Portuguese | pt | Urdu | ur |
English | en | Indonesian | id | Romanian | ro |
To manually download rules or parsers for a given language,
>>> import lambre
>>> lambre.download("ru") # Russian
Reference
If you find this toolkit helpful in your research, consider citing our paper,
@inproceedings{pratapa-etal-2021-evaluating,
title = "Evaluating the Morphosyntactic Well-formedness of Generated Texts",
author = "Pratapa, Adithya and
Anastasopoulos, Antonios and
Rijhwani, Shruti and
Chaudhary, Aditi and
Mortensen, David R. and
Neubig, Graham and
Tsvetkov, Yulia",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.570",
pages = "7131--7150",
}
We also encourage you to cite the original works for the chaudhary-etal-2021
ruleset, Chaudhary et al., 2020 and Chaudhary et al., 2021.
License
L'AMBRE is available under MIT License. The code for training parsers is adapted from stanza, which is available under Apache License, Version 2.0.
Issues
For any issues, questions or requests, please use the Github Issue Tracker.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.