Skip to main content

A python package that runs a series of operations over text to decorate a corpus

Project description

text-normaliser

A python package that runs a series of operations over text to decorate a corpus.

CircleCI PyPi version

Raison D'être :thought_balloon:

This allows text to become normalised before being sent to vector model. A series of text checking, preprocessing and decoration operations are performed on the text to strip insignificant data and to embed new information from rule based models.

Architecture :triangular_ruler:

text-normaliser has two major modes of operation, checking whether to include the line at all and then normalising the line once it is processed. In order to determine whether it should include the line the following checks are made:

  • Letter Filtering, which takes the text, strips everything but the English Latin characters, and checks the resulting length to be appropriate.
  • Language Filtering, which takes the text, and determines whether it is English based on a character ngram model. If both of those checks pass, then it performs the following normalisation operations:
  • Entity Tagging, which takes the text and tags all the entities within the text with their type (e.g. MONEY, DATE or PERSON).
  • Part-of-speech Tagging, which takes the text and performs tagging on the kind of word the word is (e.g. ADJECTIVE or NOUN). This will not tag anything grouped by the Entity Tagging.
  • Lowercasing, lowercases all the text to remove flair.
  • HTML Decoding, decodes HTML encoded characters to unicode.

Dependencies :globe_with_meridians:

Installation :inbox_tray:

The chief requirements of any installation are based off scrapy and splash, see below for local and cloud based options.

Locally

  1. Run python setup.py --install
  2. Run python -m spacy download en
  3. Run python -m unittest discover -v for the unit tests.

Via Pypi

  1. Run pip install textnormaliser
  2. Run python -m spacy download en

Usage example :eyes:

In order to normalise a corpus of text, execute the following:

textnormaliser [corpus-file] [output-file]

The result will be a new output file of the text that has been normalised. Lets say you had the following input file:

European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices

Yeah, as a Libertarian who leans left socially but right in other areas (used to be fiscally, but the right isn't about smaller government anymore!) I can personally attest to the fact that calling us names just makes us more determined to dig in and shore up.

You don't win people over by calling them stupid and telling them they are wrong.  You win them over by convincing them that there is something new and better.

This would result in the following:

norp_european nns_authorities vbd_fined org_google a nn_record money__5_1_billion in_on date_wednesday in_for vbg_abusing prp$_its nn_power in_in dt_the jj_mobile nn_phone nn_market cc_and vbd_ordered dt_the nn_company to_to vb_alter prp$_its nns_practices
uh_yeah , in_as a norp_libertarian wp_who vbz_leans vbn_left rb_socially cc_but rb_right in_in jj_other nns_areas ( vbn_used to_to vb_be rb_fiscally , cc_but dt_the nn_right vbz_is rb_n't rb_about jjr_smaller nn_government rb_anymore ! ) i md_can rb_personally vb_attest to_to dt_the nn_fact in_that vbg_calling prp_us rb_names rb_just vbz_makes prp_us rbr_more jj_determined to_to vb_dig in_in cc_and vb_shore rp_up .
prp_you vbp_do rb_n't vb_win nns_people in_over in_by vbg_calling prp_them jj_stupid cc_and vbg_telling prp_them prp_they vbp_are jj_wrong . prp_you vbp_win prp_them rp_over in_by vbg_convincing prp_them in_that ex_there vbz_is nn_something jj_new cc_and jjr_better .

License :memo:

The project is available under the MIT license.

Acknowledgements

  • Icon in README banner is text by Chameleon Design from the Noun Project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

textnormaliser-1.0.4-py2.7.egg (7.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page