Skip to main content

Artificial Error Generation (AEG) for Natural Language Processing

Project description

Generata - Generate Data

A Python package for Artificial Error Generation!

work-in-progress

About

Approach:

Data Description:

Abstracts and Titles are extracted from PubMed. NLTK Sentence Tokenizer (PunktSentenceTokenizer) is used to create sentences. Some post-processing is done on top of sentence tokenization.

No. of samples: 100,000 No. of sentences from abstracts: ~89,000 No. of titles: ~11,000

Model Analysis Dataset: The Corpus of Linguistic Acceptability (CoLA) in its full form consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by their original authors. The public version provided here contains 9594 sentences belonging to training and development sets, and excludes 1063 sentences belonging to a held out test set. Contact alexwarstadt [at] gmail [dot] com with any questions or issues. Read the paper or check out the source code for baselines.

Paper

Read the paper at https://arxiv.org/abs/1805.12471

https://nyu-mll.github.io/CoLA/

Dependencies:

  • Python (3.8+)

Install nlpaeg on your system using:

pip install nlpaeg

Usage:

Importing the library:

import nlpaeg
from nlpaeg import error_generator as eg

Instantiate the class:

g = eg.ErrorGenerator()

Initialize the parameters:

params = {

       }

Set config params

# instantiate the class
g = eg.ErrorGenerator()

# data directory
data_dir = os.path.join(os.getcwd(), "data")

# filename sentences without errors
train_data = "nlpaeg_pubmed_data_min.csv"
train_data_file_path = os.path.join(data_dir, train_data)

# read as dataframe
#df = pd.read_csv(train_data_file_path, sep="\t", header=None)
#df.columns = ["source", "valid", "note", "sentence"]

# set source data
g.source_data = pd.read_csv(train_data_file_path)

# set sentence column name
# default: sentences
g.sentence_column = "sentences"

# define n-gram order
# 4 => quadgrams, trigrams, bigrams and unigrams
# 3 => trigrams, bigrams and unigrams
# 2 => bigrams and unigrams
# default is 3; max is 5
g.ngram_order = 4

# name of columns -> predefined
# max upto 5 -grams
g.ngram_cols = {
1: "unigrams", 2: "bigrams", 3: "trigrams", 
4: "quadgrams", 5: "pentgrams"
}

# total samples
g.total_samples = len(g.source_data)

# selecting a proportion of most common n-grams
# if there are 1000 sentences and 2000 unigrams
# then we select only 30% of total unigrams
# how many ngrams to consider
# using most frequent ones
# total unigrams -> 2000; take top 600
g.n_ngrams = {
1: int(g.total_samples * 0.3),
2: int(g.total_samples * 0.2),
3: int(g.total_samples * 0.15),
4: int(g.total_samples * 0.1),
5: int(g.total_samples * 0.05),
}


# define proportion of ngram matches to modify
# for example, if there were 10 sentences in total
# changes to unigrams -> 10
# changes to bigrams -> 7
# changes to trigrams -> 3
# we'll need to use all three trigrams, most of bigrams
# and half of unigrams
# for sampling ngram changes
g.ngram_weights = {
0: 1,    # 100% of no grams
1: 0.4, # 40% of unigram changes
2: 0.6, # 60% of bigram changes
3: 0.8, # 80% of trigram changes
4: 0.95, # 95% of quadgram changes
5: 1    # 100% of pentgram changes
}



# probability distribution of artificial errors
# keys -> type of errors
# values -> distribution %
g.error_distribution = {
"dictionary_replacement_verb_form_change": 0.1,
"dictionary_replacement_word_order_change": 0.1,
"verb_form_change_order_change": 0.1,
"insert_determiner": 0.1,
"punctuation_braces": 0.05,
"punctuations": 0.05,
"duplication": 0.1,
"split_words": 0.1,
"remove_words": 0.05,


}






Create your dataframe:


# call the method to create error data
aeg_df = g.get_aeg_data()

aeg_df.to_csv('sampled_replacements_1.csv', index=None)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpaeg-0.1.1.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

nlpaeg-0.1.1-py3-none-any.whl (6.2 MB view details)

Uploaded Python 3

File details

Details for the file nlpaeg-0.1.1.tar.gz.

File metadata

  • Download URL: nlpaeg-0.1.1.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.6

File hashes

Hashes for nlpaeg-0.1.1.tar.gz
Algorithm Hash digest
SHA256 73067df4131608b0d5cd7a19b00e1aedf5f2cddf942928cc3f1bd39b4c998f1a
MD5 bcb5f4ab18aad8ed5ed029e6cf0d66d0
BLAKE2b-256 b4887b8ce274eb29b18d3d54799c44db992c5bf23aea27e30c808d90f80cf7e7

See more details on using hashes here.

File details

Details for the file nlpaeg-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: nlpaeg-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.6

File hashes

Hashes for nlpaeg-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0fab4076d7a1d0d567d188c605db3394baba8c46ccbacc1e743cb42f0c1f62c4
MD5 f92dc76a7d8e9e60ffa8c3e5c16e7110
BLAKE2b-256 a87b2d820424f9225f682c1c3ffe4158021b2eff46bb3c26abc053842d9a4b19

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page