pycontractions

Intelligently expand and create contractions in text leveraging grammar checking and Word Mover's Distance.

These details have not been verified by PyPI

Project links

Homepage

Project description

A Python library for expanding and creating common English contractions in text. This is very useful for dimensionality reduction by normalizing the text before generating word or character vectors. It performs contraction by simple replacement rules of the commonly used English contractions.

Expansion, on the other hand, is not as simple as it requires contextual knowledge in order to choose the correct replacement words. Consider the following rules:

I'd -> I would
I'd -> I had

How to automatically decide which rule to use for each match in the following text?

I’d like to know how I’d done that!

This library takes a three-pass approach. First, the simple contractions with only a single rule are replaced. On the second pass if any contractions are present with multiple rules we proceed to replace all combinations of rules to produce all possible texts. Each text is then passed through a grammar checker and the Word Mover’s Distance (WMD) is calculated between it and the original text. The hypotheses are then sorted by least number of grammatical errors and shortest distance from the original text and the top hypothesis is returned as the expanded form.

The grammatical error count eliminates the worst choices, but there are many cases that contain no or the same number of grammatical errors. In these cases the WMD works as the tie-breaker. WMD is the minimum weighted cumulative cost required to move all words from the original text to each hypothesis. This leverages the underlying Word2Vec, GloVe, or FastText, or any semantic vector model chosen. As the difference between each hypothesis is only the replacement of a contraction with it’s expansion, the “closest” hypothesis to the original text will be that with the minimum Euclidean distance between the contraction and expansion word pair in the embedding space.

Using the Google News pre-trained model yields good results but there are still some cases that can cause problems. Consider the following rules:

ain't -> am not
ain't -> are not
ain't -> is not
ain't -> has not
ain't -> have not

And the following sentence:

We ain’t all the same

The output hypotheses using the Google model will look like this (Hypothesis, WMD, # Grammar Errors):

[('We have not all the same', 0.6680542214210519, 0),
 ('We are not all the same', 0.7372250927409768, 0),
 ('We has not all the same', 0.7223834627019157, 1),
 ('We am not all the same', 0.8113022453418426, 1),
 ('We is not all the same', 0.6954222661000212, 2)]

Notice that the grammar checker eliminates the worst offenders, but two remain with no grammar errors. Among other reasons, the past-tense have is more commonly embedded between “we” and “not” than the present-tense are in the Google News dataset, therefore it yields a lower travel cost to hypothesis 1 than hypothesis 2. Trying the FastText 1 million word vector 300 dimensional model we see:

[('We have not all the same', 0.45723494251012825, 0),
 ('We are not all the same', 0.46916066501924986, 0),
 ('We has not all the same', 0.49631577238129004, 1),
 ('We am not all the same', 0.5491228638094231, 1),
 ('We is not all the same', 0.4898885599267869, 2)]

While the first result is still incorrect, the second and third have swapped position by distance. This model is much closer to providing the correct expansion. Like anything using models, your mileage will vary based on the embedding model you use and how well it matches your data. In general, however, the approach works well enough for many pre-processing tasks.

For performance, an optimized version works under the assumption that every instance of a particular contraction should be expanded the same. This is generally the case in short texts like Tweets or IRC chats. For longer texts such as comments or webpages the slower but more accurate approach will yield better results.

Example usage

>>> from pycontractions import Contractions

# Load your favorite semantic vector model in gensim keyedvectors format from disk
>>> cont = Contractions('GoogleNews-vectors-negative300.bin')

# or specify any model from the gensim.downloader api
>>> cont = Contractions(api_key="glove-twitter-100")

# or train or load your own keyedvectors model and pass it in
>>> cont = Contractions(kv_model=mykvmodel)

# optional, prevents loading on first expand_texts call
>>> cont.load_models()

The faster less precise version is the default:

>>> list(cont.expand_texts(["I'd like to know how I'd done that!",
                            "We're going to the zoo and I don't think I'll be home for dinner.",
                            "Theyre going to the zoo and she'll be home for dinner."]))
 [u'I had like to know how I had done that!',
  u'we are going to the zoo and I do not think I will be home for dinner.',
  u'they are going to the zoo and she will be home for dinner.']

Notice the error in the first text is correct below when using precise=True:

>>> list(cont.expand_texts(["I'd like to know how I'd done that!",
                            "We're going to the zoo and I don't think I'll be home for dinner.",
                            "Theyre going to the zoo and she'll be home for dinner."], precise=True))
 [u'I would like to know how I had done that!',
  u'we are going to the zoo and I do not think I will be home for dinner.',
  u'they are going to the zoo and she will be home for dinner.']

To insert contractions use the contract_texts method:

>>> list(cont.contract_texts(["I would like to know how I had done that!",
                              "We are not driving to the zoo, it will take too long.",
                              "I have already tried that and i could not figure it out"]))
 [u"I'd like to know how I'd done that!",
  u"We aren't driving to the zoo, it'll take too long.",
  u"I've already tried that and i couldn't figure it out"]

Performance differences using the precise version on an Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz:

>>> cont = Contractions(api_key="glove-twitter-25")
>>> cont.load_models()

>>> text = "Theyre going to the zoo and she'll be home for dinner."
>>> %timeit list(cont.expand_texts([text]))
10 loops, best of 3: 21.4 ms per loop
>>> %timeit list(cont.expand_texts([text], precise=True))
10 loops, best of 3: 25.1 ms per loop

# A 349 word movie review
>>> len(text.split())
349
>>> %timeit list(cont.expand_texts([text]))
1 loop, best of 3: 1.17 s per loop
>>> %timeit list(cont.expand_texts([text], precise=True))
1 loop, best of 3: 2.88 s per loop

# Contraction is fast, same 349 word movie review
>>> %timeit list(cont.contract_texts([text]))
100 loops, best of 3: 4.77 ms per loop

Installation

To install via pip:

$ pip install pycontractions

Prerequisites

language-check depends on the Java LanguageTool package, therefore this package depends on it (and Java 6.0+). The language-check installer should take care of downloading it for you, but it may take several minutes depending on internet connection.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.0.1

Sep 6, 2019

2.0.0

Sep 11, 2018

1.0.2

Sep 7, 2018

1.0.1

Aug 22, 2017

1.0.0

Jul 20, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycontractions-2.0.1.tar.gz (11.6 kB view details)

Uploaded Sep 6, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycontractions-2.0.1-py3-none-any.whl (9.6 kB view details)

Uploaded Sep 6, 2019 Python 3

File details

Details for the file pycontractions-2.0.1.tar.gz.

File metadata

Download URL: pycontractions-2.0.1.tar.gz
Upload date: Sep 6, 2019
Size: 11.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.7

File hashes

Hashes for pycontractions-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`43de2a756e3a910ef55797a7a01927f1c0c02249fa2cc087a73895e9614f2554`
MD5	`a45b428a34bdf65ceeba4c182d251af2`
BLAKE2b-256	`cb5dbecb28bdc8820e74ce35a4581eb8a668208b96152ffce63f25927e154fc1`

See more details on using hashes here.

File details

Details for the file pycontractions-2.0.1-py3-none-any.whl.

File metadata

Download URL: pycontractions-2.0.1-py3-none-any.whl
Upload date: Sep 6, 2019
Size: 9.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.7

File hashes

Hashes for pycontractions-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9b22fa71dd91d536c29923c331e866aafb907eef5386e4e838819c58e2fe1ebb`
MD5	`cd46f0485dabb93c7c80d4c97bdf9a40`
BLAKE2b-256	`a6f5d3ec9491c530cbc03af32ca2c6b69b0e89660daeb2856b485d90f9d82e5e`

See more details on using hashes here.

pycontractions 2.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Example usage

Installation

Prerequisites

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes