Project description

ndltenses

A package for training an NDL model to learn tenses using pyndl; can also be used to annotate text for tense

installation

This should be installable from PyPi so this means it requires pip. You can install the latest pip by typing "pip install pip". From a terminal, type "pip install ndl=tense". ":$ pip install ndl-tense"

parameter file

This file is what's used to tailor the execution of the processing to your needs. In the future I hope to create a GUI interface to avoid this. In the GitHub repository there is an already created parameter file with example values set and default parameter file that you should use.

There are some necessary files (NF) that need to be created by yourself and stored in a directory referenced by the NF variable.

pipeline

The pipeline.py file (found in the GitHub repository) acts as a step-by-step guide to run the code from preparing the data (including annotation) of the English tense project to training an NDL model on this data.

Based on code by Adnane Ez-zizi dated 06/08/2020; corrected, updated and adapted as a python package by Tekevwe Kwakpovwe, 08/2021. Changes listed in the changes file

Step I: Create sentences file from corpus file with verb tags

Folder variable: WD_EXTRACT
Main file: create_sentence_file.py - this is in data_preparation, see pipeline.py as an example

Step II: Convert verb tags into tenses (tense annotation)

Folder variable: WD_ANNOTATE
Main file: annotate_tenses.py (quite a long process, use the corresponding run function)
A function "Check_lengthy_sentences.R" used to find the 99% quantile of sentence lengths, to define sentence lengths will be added.
Removed sentences with num words either < 3 or > 60 (1% and 99% quantiles)

Step III: Prepare event files

Folder variable: WD_PREPDAT
File 1 (main): prepare_ndl_events.py (use corresponding run function)
1. Added the infinitives using the nltk toolkit
2. Removed sentences with no verb/tense
3. From the full set, created one row per tense/verb in each sentence. IMPORTANT NOTE: Some modals appear at the end of a sentence (e.g. certainly they should). In such cases, the VerbForm, MainVerb and infinitive are all empty, but that doesn't matter since these will not be considered as main events
4. Converted words in American English to British English using http://www.tysto.com/uk-us-spelling-list.html
5. Corrected the infinitives using Laurence list and some manual corrections + http://www.tysto.com/uk-us-spelling-list.html (for converting AE to BE)
6. From the new full set, removed modals and imperatives
7. Added a column "VerbOrder" to check the order of the verb (e.g. 1st, 3rd).
8. Created contexts by removing the verb form and replacing it with the infinitive (words seperated with underscores)
9. Generated word cues from the context columns (cues with or without infinitives) - Added 2 columns
10. Generated n-gram cues (from 1 to 4) from the context columns (cues with or without infinitives) - Added 2 columns
11. Shuffled the sentences while keeping the order of events within each sentence
Folder variable 2:
File 2: Prepare_train_valid_test.py (use corresponding run function to use) -> Splited the data into training (90%), validation (5%) and test (5%) sets. The split was done at the level of sentences NOT events.
File 3: prepare_ndl_events.py (use corresponding run function to use) -> This prepare .gz event files ready to be used with NDL
Note: I prepared two dataset types: 'multiverbs' contains complex sentences with more than one verb while 'oneverb' contains only sentences that have a single verb. The "one verb" version is currently not in use.

Step IV: Prepare the infinitive cues to use

Folder variable: WD_EXTRACT_INF
File 1 optional (main): extract_infinitive.py (use corresponding run function to use)
File 2: Extract_infinitive_freqs_oneverb.py (for the dataset containing only sentences with a single verb)

-> Created a list of all possible infinitives with their co-occurence frequencies with each tense -> Extracted the list of all infinitives that have a freq > 10 to be used as lexical cues

Step V: Prepare the n-gram cues to use

Folder variable: WD_EXTRACT_NGRAM
File 2: prepare_ngrams.py (there is also code for the 'oneverb' data set but this is not currently in use) -> Extracted a list of n-grams to include in the training (10k most frequent from each level n + all have freq>10)

Step VI: Prepare the n-gram cues to use

Folder variable: WD_CUES
File 1 : prepare_cues.py (There is also code for the 'oneverb' data set but this is not currently in use) -> Created a separate list of all n-grams for each level n (1 to 4) with their frequency in the dataset

Step VII: Prepare the n-gram cues to use

Folder variable: WD_SIM
File 1: ndl_model -> Train an NDL model

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.3

Oct 11, 2021

0.0.2

Aug 30, 2021

0.0.1

Aug 30, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ndl_tense-0.0.3.tar.gz (6.6 kB view hashes)

Uploaded Oct 11, 2021 Source

Hashes for ndl_tense-0.0.3.tar.gz

Hashes for ndl_tense-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`3cda47112e27d71f6d0f7ecdf3ac89cc9000121bac486f61ef64561d481e46a9`
MD5	`9baad0e5487e224e8aada5804f5c0544`
BLAKE2b-256	`4ff194f7dcdaf9682a47b98ba05aa5932d64f30e85e589300fcef7c9751d43d1`