No project description provided
Project description
FSTmorph
This repository houses a set of language-neutral tools for converting CSVs to a finite-state transducer (FST), as well as for testing that FST via YAML files.
Contents
User Instructions
Examples are given for Ojibwe, with the relevant language data accessible in other repos. However, this FST-generating code is intended to be applicable for other Algonquian languages and beyond -- if you have the necessary spreadsheets for your target language, it should be compatible with this code!
Getting set up to build the FST
Will need updating after repo restructuring!
You will need to install the foma compiler and flookup program (which
is part of the foma toolkit). On Mac or Linux, the easiest way to install is via homebrew. Just use the command brew install foma. Alternatively, there are other installation instructions here (including for Windows users).
Note for Windows users: In addition to the page given above, we found these instructions useful for installing. Also, ensure that the directory you add to your PATH immediately contains
foma.exeandflookup.exe. For example, if the path tofoma.exeisC:\Program Files (x86)\Foma\win32\foma.exe, then addC:\Program Files (x86)\Foma\win32(notC:\Program Files (x86)\Foma\) to your PATH.
This project uses poetry to manage python requirements. These can be installed as follows:
- Navigate to
ParserTools/csv2fst. - Create a virtual environment (using the python command that works on your system e.g.,
pythoninstead ofpython3if needed):
python3 -m venv myvenv
- Activate the virtual environment.
On Mac or Linux, run:
source myvenv/bin/activate
On Windows, run:
source myvenv/Scripts/activate
- Update
pipandsetuptools:
pip3 install -U pip setuptools
- Install
poetry:
pip3 install poetry
- Use poetry to install the project into the virtual environment:
poetry install
Now, the dependencies are set up!
Eventually, when you're done running the virtual environment, you can close it via:
deactivate myvenv
Note: The
Makefilein this directory begins with some variables you can change if you run into errors. For example, you can change the command for running python (on Windows, we had to change this frompython3topython).
If you're working towards building the example FST for Ojibwe, once you're done with installing these prerequisites, carry on with the instructions in OjibweMorph. But make sure you keep the virtual environment you set up here up and running!
- Just change directories so that you end up back in your local copy of OjibweMorph, ready to run commands there, with this virtual environment still active. For example, if you installed OjibweMorph and ParserTools in the same directory, just use the command
cd ../..OjibweMorph/(because you're currently inParserTools/csv2fst/). - If you get a
ModuleNotFoundErrorwhen running commands in OjibweMorph, it may be that you are no longer in the virtual environment (and so the modules installed in the virtual environment are not accessible). - You can always check that the name of the virtual environment still appears in your command line prompt to know if it is active.
Building the FST
The FST is built using a Makefile. Before building, there are three variables within the Makefile which must be set to point to the right directory locations:
MORPHOLOGYSRCDIRpoints to a directory that contains most of the morphological information needed to build the FST. The example directory (for Border Lakes Ojibwe) is OjibweMorph.LEMMAS_DIRpoints to a directory that contains CSVs listing all the lemmas that will be used to build the FST. An example directory (for Border Lakes Ojibwe) is OjibweLexicon/OPD.- This variable can also be set to a list of directories (each containing CSVs to be used), separated by a comma.
OUTPUT_DIRpoints to the directory where all files generated will go. Note that in order for the testing code to work (i.e., when runningmake check), this must be a relative path (for some reason).
You should go into the Makefile and edit the values of these variables so that the correct directory is specified. Once complete, you can run make all (or just make) to build the FST (e.g., ojibwe.fomabin). This will create a directory generated which contains the FST, lexc files and XFST rules. Once complete, you can use the FST -- check here for an example.
Alternatively, rather than editing the Makefile contents, you can just specify the directory paths when you call make all. For example:
make all MORPHOLOGYSRCDIR=~/Documents/OjibweMorph LEMMAS_DIR=~/Documents/OjibweLexicon/OPD OUTPUT_DIR=../../OjibweMorph/FST
These variables should also be set when running make clean.
Additionally, there are two other variables that must be specified if you're going to run the tests for the FST:
SPREADSHEETS_FOR_YAML_DIRpoints to a directory which contains CSVs for running the YAML tests. An example directory (for Border Lakes Ojibwe) is OjibweLexicon/OPD/for_yaml.PARADIGM_MAPS_DIRpoints to a directory which contains paradigm maps created for sorting words into different categories, but used here because they contain a complete list of 'Class' values that become the test sections (e.g., VAI_V, VAI_VV, etc.). An example directory (for Border Lakes Ojibwe) is OjibweLexicon/resources.
So when running make check to run the FST tests, you either need to edit all five variables right in the Makefile, or supply their values when calling make check:
make check MORPHOLOGYSRCDIR=~/Documents/OjibweMorph LEMMAS_DIR=~/Documents/OjibweLexicon/OPD SPREADSHEETS_FOR_YAML_DIR=~/Documents/OjibweLexicon/OPD/for_yaml PARADIGM_MAPS_DIR=~/Documents/OjibweLexicon/resources OUTPUT_DIR=../../OjibweMorph/FST
Also written into the Makefile are the expected names of many of these files (e.g., the paradigm map file for the verb tests of the FST being called VERBS_paradigm_map.csv), so if any of these names differ, the Makefile will have to be updated accordingly.
Running the YAML Tests
The code for running tests based on the generated YAML files comes from giella-core. A version of their morph-test.py script is included in this repo (as run_yaml_tests.py), modified to customize the .log output format.
Run the tests with make check. This will generate three log files:
paradigm-test.log: A smaller set of tests covering the noun and verb spreadsheets inOjibweMorph.opd-test.log: A larger set of tests covering an external lexical resource, the OPD.
The csv2lexc.py script
Usage: csv2lexc.py [OPTIONS]
Options:
--config-file TEXT JSON config file [required]
--lexc-path TEXT Directory where output lexc files are stored [required]
--read-lexical-database BOOLEAN
Whether to include lexemes from an external
lexicon database
--help Show this message and exit.
JSON configuration files
You need to specify a JSON configuration file which controls the
generation of lexc files. See
verbs.json
in the OjibweMorph
repository for an example.
You need to specify the following parameters:
| Parameter | Description | Example |
|---|---|---|
comments |
Comments. | "This is a comment" |
source_path |
Directory where source CSV files reside. | "~/Documents/OjibweMorph/VerbSpreadsheets/" |
regular_csv_files |
List of CSV files which contain regular paradigms (please omit .csv suffix) |
["VAI_IND","VTA_CNJ",...] |
irregular_csv_files |
List of CSV files which contain irregular paradigms (please omit .csv suffix) |
["VAI_IRR"] |
lexical_database |
External CSV lexical database file. | "VERBS.csv" |
regular_lexc_file |
Filename for generated lexc file containing regular paradigms. | "ojibwe_verbs_regular.lexc" |
irregular_lexc_file |
Filename for generated lexc file containing irregular paradigms. | "ojibwe_verbs_irregular.lexc" |
morph_features |
This field specifies the order in which morphological features are realized in FST output fields. The elements in the list have to match columns of the spurce CSV files. | ["Paradigm", "Order", "Negation", "Mode", "Subject", "Object"] |
missing_tag_marker |
Tag which indicates missing values of features in the source CSV files. E.g. intransitive verbs won't have an object, and this tag is used to mark that fact. | "NA" |
missing_form_marker |
Tag which indicates paradigm gaps | "MISSING" |
multichar_symbols |
List of multi-character symbols which are used in the source CSV files | ["i1", "w1"] |
pre_element_tag |
A tag which is used to indicate the position of pre-elements like preverbs and prenouns in the lexc files | "[PREVERB]" |
pv_source_path |
This should point to your PVSpreadsheets directory |
"~/Documents/OjibweMorph/PVSpreadsheets/" |
template_path |
Path to jinja2 templates. Note that (for some reason) this file path cannot use a tilde symbol. | "/Users/YourName/Documents/OjibweMorph/templates" |
The external lexical database
Additional lexemes are supplied in CSV files. For example, here are the first few rows of the lexical database file of verbs for Border Lakes Ojibwe, sourced from the Ojibwe People's Dictionary (OPD):
Lemma,Stem,Paradigm,Class,Translation,Source
aazhoogaadebi,aazhoogaadebi,VAI,VAI_V,NONE,https://ojibwe.lib.umn.edu/main-entry/aazhoogaadebi-vai
aayaazhoogaadebi,aayaazhoogaadebi,VAI,VAI_V,NONE,https://ojibwe.lib.umn.edu/main-entry/aazhoogaadebi-vai
aazhooshkaa,aazhooshkaa,VAI,VAI_VV,NONE,https://ojibwe.lib.umn.edu/main-entry/aazhooshkaa-vai
The lemma and stem must agree with the forms in the source spreadsheets where applicable (stored in source_path as specified above).
In the JSON configuration file, the path to the lexical database is supplied under the key lexical_database.
Citation
To cite this work or the contents of the repository in an academic work, please use the following:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fstmorph-0.1.0.tar.gz.
File metadata
- Download URL: fstmorph-0.1.0.tar.gz
- Upload date:
- Size: 38.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.10.7 Darwin/21.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1436408648c634cf0d7ce1e745837505235a8b37a48f2e664a05d65b101e961
|
|
| MD5 |
80c92251bde62388b0980c5c1c53ffb0
|
|
| BLAKE2b-256 |
4a04300d071f2b83244633abda4aca7afab3110bbe9b6f1538c720fc8b3dcfd8
|
File details
Details for the file fstmorph-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fstmorph-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.10.7 Darwin/21.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26468f800092a831fdb573207caa48ae3f546da89455c91050a5ce0ec50586da
|
|
| MD5 |
f280f360fe7444f8883454bb3676c705
|
|
| BLAKE2b-256 |
fc92e549c53d6520756234982f422ea2120b3bbc6c20c95c4c0d69d006d54fcd
|