Skip to main content

A python package that runs a series of operations over text to decorate a corpus

Project description

text-normaliser

A python package that runs a series of operations over text to decorate a corpus.

Raison D'être :thought_balloon:

This allows text to become normalised before being sent to vector model. A series of text checking, preprocessing and decoration operations are performed on the text to strip insignificant data and to embed new information from rule based models.

Architecture :triangular_ruler:

text-normaliser has two major modes of operation, checking whether to include the line at all and then normalising the line once it is processed. In order to determine whether it should include the line the following checks are made:

  • Letter Filtering, which takes the text, strips everything but the English Latin characters, and checks the resulting length to be appropriate.
  • Language Filtering, which takes the text, and determines whether it is English based on a character ngram model. If both of those checks pass, then it performs the following normalisation operations:
  • Entity Tagging, which takes the text and tags all the entities within the text with their type (e.g. MONEY, DATE or PERSON).
  • Part-of-speech Tagging, which takes the text and performs tagging on the kind of word the word is (e.g. ADJECTIVE or NOUN). This will not tag anything grouped by the Entity Tagging.
  • Lowercasing, lowercases all the text to remove flair.

Dependencies :globe_with_meridians:

Installation :inbox_tray:

The chief requirements of any installation are based off scrapy and splash, see below for local and cloud based options.

Locally

  1. Run python setup.py --install
  2. Run python -m spacy download en

Via Pypi

  1. Run pip install textnormaliser
  2. Run python -m spacy download en

Usage example :eyes:

In order to normalise a corpus of text, execute the following:

textnormaliser [corpus-file] [output-file]

The result will be a new output file of the text that has been normalised. Lets say you had the following input file:

European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices

Yeah, as a Libertarian who leans left socially but right in other areas (used to be fiscally, but the right isn't about smaller government anymore!) I can personally attest to the fact that calling us names just makes us more determined to dig in and shore up.

You don't win people over by calling them stupid and telling them they are wrong.  You win them over by convincing them that there is something new and better.

This would result in the following:

norp_european nns_authorities vbd_fined org_google a nn_record nnp_money_ 5 cd_1_billion in_on date_wednesday in_for vbg_abusing prp$_its nn_power in_in dt_the jj_mobile nn_phone nn_market cc_and vbd_ordered dt_the nn_company to_to vb_alter prp$_its nns_practices
uh_yeah , in_as a norp_libertarian wp_who vbz_leans vbn_left rb_socially cc_but rb_right in_in jj_other nns_areas ( vbn_used to_to vb_be rb_fiscally , cc_but dt_the nn_right vbz_is rb_n't rb_about jjr_smaller nn_government rb_anymore ! ) i md_can rb_personally vb_attest to_to dt_the nn_fact in_that vbg_calling prp_us rb_names rb_just vbz_makes prp_us rbr_more jj_determined to_to vb_dig in_in cc_and vb_shore rp_up .
prp_you vbp_do rb_n't vb_win nns_people in_over in_by vbg_calling prp_them jj_stupid cc_and vbg_telling prp_them prp_they vbp_are jj_wrong . prp_you vbp_win prp_them rp_over in_by vbg_convincing prp_them in_that ex_there vbz_is nn_something jj_new cc_and jjr_better .

License :memo:

The project is available under the MIT license.

Acknowledgements

  • Icon in README banner is text by Chameleon Design from the Noun Project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textnormaliser-1.0.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distributions

textnormaliser-1.0.0-py2.7.egg (6.3 kB view details)

Uploaded Source

textnormaliser-1.0.0-py2-none-any.whl (5.4 kB view details)

Uploaded Python 2

File details

Details for the file textnormaliser-1.0.0.tar.gz.

File metadata

  • Download URL: textnormaliser-1.0.0.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.10

File hashes

Hashes for textnormaliser-1.0.0.tar.gz
Algorithm Hash digest
SHA256 042691f0390b0d8e3ee444ae07a3417708d98545779f405d6a0c775189c3ef47
MD5 23109ad09ae75c56ce773dc9dfdccd89
BLAKE2b-256 69ddae524fc03891e3d2e71f5c73eb98e28de1134c37d84be2be0c02a1b6abec

See more details on using hashes here.

File details

Details for the file textnormaliser-1.0.0-py2.7.egg.

File metadata

  • Download URL: textnormaliser-1.0.0-py2.7.egg
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15

File hashes

Hashes for textnormaliser-1.0.0-py2.7.egg
Algorithm Hash digest
SHA256 9806d8115b5c8311d8bd85c2b4702181a746c5cdb7a9dabbc80c19aa8afb2494
MD5 bcd532a878bdf052179cfbc99b147f89
BLAKE2b-256 ce7e9a80b73adcd68266c4fdc38c546c647e31c681e05457865e7f6de2c38408

See more details on using hashes here.

File details

Details for the file textnormaliser-1.0.0-py2-none-any.whl.

File metadata

  • Download URL: textnormaliser-1.0.0-py2-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.10

File hashes

Hashes for textnormaliser-1.0.0-py2-none-any.whl
Algorithm Hash digest
SHA256 5e67b55c3dda3900997fa17370373842874fcc91501a70e1e2598160ebb29b1e
MD5 c1e5c429aa38561dca4d5ec4dab3364c
BLAKE2b-256 61a04c31c8e6417f1795055b2ff0c06f4ee10b5ffdebb514f6f972c60a4b510b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page