Skip to main content

Probabilistic Noising of Natural Language

Project description

LICENSE GitHub issues PyPI CircleCI

Artext: Artificial Text Generation

Probabilistic Noising of Natural Language

Artext is a work on injecting noise into text without affecting the core meaning for a human reader. This kind of data can be useful for many NLP tasks, particulary in making models robust to noisy/erroneous input.

Note: Noising will generally increase the vocabulary size of the data sets, as it introduces word inflections and orthographic variations that may not have existed before. Therefore, it should be used with caution, especially for closed-vocabulary neural network models such as machine translation. In such scenarios, consider using subword based vocabulary (BPE for instance).

This is a work in progress, and the result of our experiments we will published soon. Meanwhile, if you use artext in your research please cite this repository.

Setup

artext's developed and tested with Python 3.6 and can be installed in two ways:

  1. Using pip:
 pip install artext
  1. From source code:
git clone https://github.com/fgaim/artext
cd artext
pip install -r requirements.txt
python setup.py install

Get required resources:

python -m spacy download 'en_core_web_sm'
python -m nltk.downloader 'punkt'
python -m nltk.downloader 'wordnet'

Usage

Use from command-line

Generate sentence (sent) or document (doc) level noise samples for a text file as follows:

python -m artext -src source.txt -out output.txt -l sent -er 0.5 -n 10

[or] From source code using inject.py as follows:

python inject.py -src source.txt -out output.txt -l sent -er 0.5 -n 10

Use -h to see all options.

Use as a library

from artext import Artext

artxt = Artext()
artxt.samples = 5
artxt.error_rate = 0.25
sent = 'This is a sample sentence to be noised.'
noises = artxt.noise_sentence(sent)
print(noises)

Examples

python example.py -er 0.5 -n 10

Sentence Level Examples

Input (clean sentence from Lang-8):

So , I think if we have to go somewhere on foot , we must put on our hat .

Human (error example from Lang-8):

So , I think if we have to go somewhere on foot , we must put on our hat .

Output (artext):

  • So , I think if we have to go <ins>going</ins> somewhere on foot <ins>feet</ins> , we must put on our hat . <ins>?</ins>
  • So , I think <ins>thinking</ins> if we have to go somewhere on foot , we must put on <ins>!</ins> our hat <ins>hats</ins> .
  • So , I think if we have <ins>we</ins> to go somewhere on foot <ins>feet</ins> , we must put on our hat . <ins>;</ins>
  • So , I think if we have to go somewhere on foot , we must put <ins>must</ins> on our hat <ins>hats</ins> .
  • So , I think if we have to go somewhere on foot <ins>feet</ins> , we must put on <ins>put</ins> our hat .
  • So , <ins>;</ins> I think if we have <ins>take</ins> to go somewhere on foot , we must put on our hat <ins>hats</ins> .
  • So , I think if we have to go somewhere <ins>someplace</ins> on foot , we must put <ins>putting</ins> on our hat <ins>hats</ins> .
  • So , I think if we have to go somewhere on foot , we must put on our hat . <ins>chapeau ;</ins>
  • So , I think if we have <ins>we</ins> to go somewhere <ins>go</ins> on foot , we must put on our hat .
  • So , I think <ins>retrieve</ins> if we have <ins>having</ins> to go <ins>going</ins> somewhere on foot , <ins>substructure</ins> we must put <ins>putting</ins> on our hat .

Document Level Examples

Input (clean sentence from Lang-8):

This morning I found out that one of my favourite bands released a new album .
I already forgot about Rise Against and it is a great surprise for me, because I haven't listened to them for 2 years .
I hope this band did n't become worse, like many others big ones did , and I 'll enjoy listening to it .
Well , I just have to get it and check it out .

Human (error example from Lang-8):

This morning I found out that one of my favourite bands <ins>band</ins> released a <ins>his</ins> new album . I already forgot about Rise Against and <ins>an</ins> it is a great surprise for me , because I have <ins>did</ins> n't listened <ins>return</ins> to them for 2 years . I hope this band did n't become worse , <ins>yet</ins> like many others big ones did , and I 'll enjoy listening to it . Well , I just have <ins>there remains</ins> to get it and check it out .

Output (artext):

  • This morning I found out that one of my favourite <ins>favored</ins> bands released a new album . I already forgot about Rise Against <ins>grow Agianst</ins> and it is <ins>are</ins> a great surprise for me , because I have n't listened <ins>listen</ins> to them for 2 years . I hope <ins>hoping</ins> this band did <ins>bands serve</ins> n't become worse , like many others big ones did , and I 'll enjoy listening to <ins>listening</ins> it . Well , I just have <ins>deliver</ins> to get it and check it out .
  • This morning I found out that one of my favourite bands released <ins>band</ins> a <ins>released</ins> new album . I already forgot <ins>forget</ins> about Rise Against <ins>Aigniast</ins> and it is a great surprise for me , because I <ins>beceause</ins> have n't listened to them for 2 years <ins>geezerhood</ins> . I hope <ins>hoping</ins> this band did <ins>bands</ins> n't become worse , <ins>did becoming wore</ins> like many others <ins>other</ins> big ones did , <ins>didding ;</ins> and I 'll enjoy listening to it . Well <ins>eWll</ins> , I just have to get it and check it out .
  • This morning I found out that one <ins>that</ins> of my favourite bands released a new album <ins>albums</ins> . I already forgot <ins>forgotting</ins> about Rise Against <ins>Aainst</ins> and it is <ins>be</ins> a great surprise <ins>surprisal</ins> for me , because I have <ins>having</ins> n't listened <ins>listneed</ins> to them <ins>tem</ins> for 2 years . I hope this band did <ins>do</ins> n't become worse , like many others big ones did <ins>didding</ins> , and I 'll enjoy listening to it . Well , I just have to get it and check <ins>checking</ins> it out .
  • This morning I found out that one of my favourite bands released a new album . I already forgot about <ins>abuot</ins> Rise Against <ins>Agaiinst</ins> and it is a great surprise <ins>srrpuise</ins> for me , because I have n't listened <ins>listening</ins> to them for 2 years <ins>year</ins> . I hope this band did n't become worse , like many others big <ins>other</ins> ones did , and I 'll enjoy listening <ins>enjoying litening</ins> to it . Well , I just <ins>scarce</ins> have to get <ins>getting</ins> it and check it <ins>checking</ins> out <ins>it</ins> .
  • This morning <ins>mornings</ins> I found <ins>ground</ins> out that <ins>hTat</ins> one of my favourite bands <ins>favorite band</ins> released a new album . I already forgot <ins>forget</ins> about Rise Against <ins>arise Agsinat</ins> and it is a great surprise <ins>surprisal</ins> for me , because I have <ins>because</ins> n't listened <ins>have listen</ins> to them for 2 years <ins>year</ins> . I hope this band did n't become worse <ins>tough</ins> , like many others <ins>other</ins> big ones did , and I 'll enjoy listening <ins>enjoy</ins> to it . <ins>?</ins> Well , I just <ins>hardly</ins> have to get it and check it out .
  • This morning I found <ins>fnuod</ins> out that <ins>htat</ins> one of my favourite bands released <ins>releasing</ins> a newalbum . I already forgot about <ins>abut</ins> Rise Against <ins>Aigainst</ins> and it is a great surprise <ins>surprises</ins> for me , because <ins>becuasae</ins> I have n't listened to them for 2 years <ins>year</ins> . I hope this band did n't become <ins>becoming</ins> worse , like many others <ins>other</ins> big ones <ins>one</ins> did , and I 'll enjoy listening <ins>enjoying</ins> to it . Well , I just have to <ins>having</ins> get <ins>to</ins> it and check it out . <ins>!</ins>
  • This morning I found out that one of my favourite <ins>my</ins> bands released <ins>release</ins> a new album . I already forgot <ins>alraedyy forgotting</ins> about Rise Against <ins>Aagaianst</ins> and it is <ins>are</ins> a great surprise <ins>surprises</ins> for me , <ins>.</ins> because I have n't listened <ins>listen</ins> to them for 2 years . I hope this band did <ins>band</ins> n't become worse , like many others big ones did , and I 'll enjoy listening to it . Well , I just have to get it and check it out .
  • This morning I found <ins>incur</ins> out that one of my favourite <ins>favored</ins> bands released <ins>releaseed</ins> a new album <ins>albums</ins> . I already forgot about Rise Against <ins>igAanst</ins> and it is a great <ins>grat</ins> surprisefor me , because I have <ins>having</ins> n't listened <ins>listen</ins> to them for 2 years . I hope this band did n't becomeworse , like many others big ones did <ins>one do</ins> , and I 'll enjoy <ins>enjoying</ins> listening to it . Well , <ins>:</ins> I just have <ins>having</ins> to get <ins>getting</ins> it and check it out .
  • This morning I found <ins>founding</ins> out that <ins>hTat</ins> one of my favourite bands released <ins>releasing</ins> a new <ins>newfangled</ins> album . I already forgot <ins>block</ins> about Rise Against <ins>Aganst</ins> and it is a great surprise for me , because <ins>becuasee</ins> I have n't listened to them for 2 years . I hope this <ins>tthis</ins> band did n't become <ins>becoming</ins> worse , <ins>:</ins> like many others big ones did , and I 'll enjoy listening to it . Well , I just have <ins>having</ins> to get it and check it out . <ins>.</ins>
  • This morning I found <ins>I</ins> out that one of my favourite bands released <ins>band releasing</ins> a new album . I already forgot about Rise <ins>Rising</ins> Against and it is a great <ins>is Greeat</ins> surprise for me , because I have n't listened to them for 2 years . I hope <ins>desire</ins> this band did n't become worse , like many others big ones did <ins>didding</ins> , and I 'll enjoy <ins>enjoying</ins> listening to it . Well , <ins>?</ins> I just have to get it and check it out .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

artext-0.2.9.tar.gz (35.3 kB view hashes)

Uploaded source

Built Distribution

artext-0.2.9-py3-none-any.whl (34.5 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page