This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

A module for normalising text.

Introduction

This module takes a text as input, and returns it in a normalised form, ie. expands all word tokens deemed not to be of a standard type. Non-standard words (NSWs) are detected, classified and expanded. Examples of NSWs that are normalised include:

  • Numbers - percentages, dates, currency amounts, ranges, telephone numbers.
  • Abbreviations and acronyms.
  • Web addresses and hashtags.

Table of Contents

  1. Installation

  2. Usage

    2.1. Customise to your variety

    2.2. Input your own abbreviation dictionary

    2.3. Execute normalise from the command line

  3. Example

  4. Authors

  5. License

  6. Acknowledgements

1. Installation

normalise requires Python 3.

To install the module (on Windows, Mac OS X, Linux, etc.) and to ensure that you have the latest version of pip and setuptools:

$ pip install --upgrade pip setuptools

$ pip install normalise

If pip installation fails, you can try easy_install normalise.

2. Usage

Your input text can be a list of words, or a string.

To normalise your text, use the normalise function. This will return the text with NSWs replaced by their expansions:

text = ["On", "the", "28", "Apr.", "2010", ",", "Dr.", "Banks", "bought", "a", "chair", "for", "£35", "."]

normalise(text, verbose=True)

Out:
['On',
 'the',
 'twenty-eighth of',
 'April',
 'twenty ten',
 ',',
 'Doctor',
 'Banks',
 'bought',
 'a',
 'chair',
 'for',
 'thirty five pounds',
 '.']

verbose=True displays the stages of the normalisation process, so you can monitor its progress. To turn this off, use verbose=False.

If your input is a string, you can use our basic tokenizer. For best results, input your own custom tokenizer.

normalise(text, tokenizer=tokenize_basic, verbose=True)

In order to see a list of all NSWs in your text, along with their index, tags, and expansion, use the list_NSWs function:

list_NSWs(text)

 Out:
 ({3: ('Apr.', 'ALPHA', 'EXPN', 'April'),
   6: ('Dr.', 'ALPHA', 'EXPN', 'Doctor')},
  {2: ('28', 'NUMB', 'NORD', 'twenty-eighth of'),
   4: ('2010', 'NUMB', 'NYER', 'twenty ten'),
   12: ('£35', 'NUMB', 'MONEY', 'thirty five pounds')}

2.1. Customise to your variety

In order to customise normalisation to your variety of English, use variety="BrE" for British English, or variety="AmE" for American English:

text = ["On", "10/04", ",", "he", "went", "to", "the", "seaside", "."]

normalise(text, variety="BrE")
Out: ['On', 'the tenth of April', ',', 'he', 'went', 'to', 'the', 'seaside', '.']

normalise(text, variety="AmE")
Out: ['On', 'the fourth of October', ',', 'he', 'went', 'to', 'the', 'seaside', '.']

If a variety is not specified, our default is British English.

2.2. Input your own abbreviation dictionary

Although our system aims to be domain-general, users can input their own dictionary of abbreviations in order to tailor to a specific domain. This can be done using the keyword argument user_abbrevs={}:

my_abbreviations = {"bdrm": "bedroom",
                    "KT": "kitchen",
                    "wndw": "window",
                    "ONO": "or near offer"}

text = ["4bdrm", "house", "for", "sale", ",", "£459k", "ONO"]

normalise(text, user_abbrevs=my_abbreviations)

Out:
['four bedroom',
 'house',
 'for',
 'sale',
 ',',
 'four hundred and fifty nine thousand pounds',
 'or near offer']

2.3. Execute normalise from the command line

From the command line, you can normalise text from a given .txt file. Use the command normalise /path/to/your-file.txt. This will print the normalised output, as well as save it to a separate file “your-file_normalised.txt” in the same directory as the original text.

To specify the variety as American English, use –AmE (default is British English). For a verbose output, use –V:

$ normalise /path/to/your\_file.txt --AmE --V

3. Example

A further example demonstrating the expansion of more types of NSW (including abbreviations, spelling mistakes, percentage ranges, currency):

text = ["On", "the", "13", "Feb.", "2007", ",", "Theresa", "May",
"MP", "announced",
"on", "ITV", "News", "that", "the", "rate", "of", "childhod",
"obesity", "had", "risen",
"from", "7.3-9.6%", "in", "just", "3", "years", ",", "costing", "the",
"Gov.", "£20m", "."]

normalise(text, verbose=True)

Out:
['On',
 'the',
 'thirteenth of',
 'February',
 'two thousand and seven',
 'Theresa',
 'May',
 'M P',
 'announced',
 'on',
 'I T V',
 'News',
 'that',
 'the',
 'rate',
 'of',
 'childhood',
 'obesity',
 'had',
 'risen',
 'from',
 'seven point three to nine point six percent',
 'in',
 'just',
 'three',
 'years',
 ',',
 'costing',
 'the',
 'government',
 'twenty million pounds',
 '.']

4. Authors

Our system is described in detail in Emma Flint, Elliot Ford, Olivia Thomas, Andrew Caines & Paula Buttery (2016) - A Text Normalisation System for Non-Standard Words.

5. License

This project is licensed under the terms of the GNU General Public License version 3.0 or later.

Please see LICENSE.txt for more information.

6. Acknowledgements

This project builds on the work described in Sproat et al (2001).

We would like to thank Andrew Caines and Paula Buttery for supervising us during this project.

The font used for the logo was Anita Semi-Square by Gustavo Paz.
Release History

Release History

0.1.6

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.5

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.4

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
normalise-0.1.6.tar.gz (15.2 MB) Copy SHA256 Checksum SHA256 Source Oct 11, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting