pynlp

Python wrapper for Stanford CoreNLP

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

# pynlp
[![Build Status](https://travis-ci.org/sina-al/pynlp.svg?branch=master)](https://travis-ci.org/sina-al/pynlp)
[![PyPI version](https://badge.fury.io/py/pynlp.svg)](https://badge.fury.io/py/pynlp)

A *pythonic* wrapper for Stanford CoreNLP.

<p align="center">
<img src="https://media.giphy.com/media/l2QDNvOnIK6H2CRgY/giphy.gif" >
</p>

## Description
This library provides a Python interface to [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) built over [`corenlp_protobuf`](https://github.com/stanfordnlp/python-corenlp-protobuf).

## Installation
1. Download Stanford CoreNLP from the official [download page](https://stanfordnlp.github.io/CoreNLP/download.html).
2. Unzip the file and set your `CORE_NLP` environment variable to point to the directory.
3. Install `pynlp` from pip
```
pip3 install pynlp
```

## Quick Start

### Launch the server
Lauch the `StanfordCoreNLPServer` using the instruction given [here](https://stanfordnlp.github.io/CoreNLP/corenlp-server.html). *Alternatively*, simply run the module.
```
python3 -m pynlp
```
*By default, this lauches the server on localhost using port 9000 and 4gb ram for the JVM. Use the `--help` option for instruction on custom configurations.*

### Example

Let's start off with an excerpt from a CNN article.
```python
text = ('GOP Sen. Rand Paul was assaulted in his home in Bowling Green, Kentucky, on Friday, '
'according to Kentucky State Police. State troopers responded to a call to the senator\'s '
'residence at 3:21 p.m. Friday. Police arrested a man named Rene Albert Boucher, who they '
'allege "intentionally assaulted" Paul, causing him "minor injury". Boucher, 59, of Bowling '
'Green was charged with one count of fourth-degree assault. As of Saturday afternoon, he '
'was being held in the Warren County Regional Jail on a $5,000 bond.')
```
### Instantiate annotator
Here we demonstrate the following annotators:
* **Annotoators:** *tokenize, ssplit, pos, lemma, ner, entitymentions, coref, sentiment, quote, openie*
* **Options:** *openie.resolve_coref*
```python
from pynlp import StanfordCoreNLP

annotators = 'tokenize, ssplit, pos, lemma, ner, entitymentions, coref, sentiment, quote, openie'
options = {'openie.resolve_coref': True}

nlp = StanfordCoreNLP(annotators=annotators, options=options)

```
### Annotate text
The `nlp` instance is callable. Use it to annotate the text and return a `Document` object.
```python
document = nlp(text)

print(document) # prints 'text'
```
#### Sentence splitting
Let's test the *ssplit* annotator. A `Document` object iterates over its `Sentence` objects.
```python
for index, sentence in enumerate(document):
print(index, sentence, sep=' )')
```
Output:
```
0) GOP Sen. Rand Paul was assaulted in his home in Bowling Green, Kentucky, on Friday, according to Kentucky State Police.
1) State troopers responded to a call to the senator's residence at 3:21 p.m. Friday.
2) Police arrested a man named Rene Albert Boucher, who they allege "intentionally assaulted" Paul, causing him "minor injury".
3) Boucher, 59, of Bowling Green was charged with one count of fourth-degree assault.
4) As of Saturday afternoon, he was being held in the Warren County Regional Jail on a $5,000 bond.
```
#### Named entity recognition
How about finding all the people mentioned in the document?
```python
[str(entity) for entity in document.entities if entity.type == 'PERSON']
```
Output:
```
Out[2]: ['Rand Paul', 'Rene Albert Boucher', 'Paul', 'Boucher']
```
We may use named entities on a sentence level too.
```python
first_sentence = document[0]
for entity in first_sentence.entities:
print(entity, '({})'.format(entity.type))
```
Output:
```
GOP (ORGANIZATION)
Rand Paul (PERSON)
Bowling Green (LOCATION)
Kentucky (LOCATION)
Friday (DATE)
Kentucky State Police (ORGANIZATION)
```
#### Part-of-speech tagging
Let's find all the 'VB' tags in the first sentence. A `Sentence` object iterates over `Token` objects.
```python
for token in first_sentence:
if 'VB' in token.pos:
print(token, token.pos)
```
Output:
```
was VBD
assaulted VBN
according VBG
```
#### Lemmatization
Using the same words, lets see the lemmas.
```python
for token in first_sentence:
if 'VB' in token.pos:
print(token, '->', token.lemma)
```
Output:
```
was -> be
assaulted -> assault
according -> accord
```
#### Coreference resultion
Let's use pynlp to find the first `CorefChain` in the text.
```python
chain = document.coref_chains[0]
print(chain)
```
Output:
```
((GOP Sen. Rand Paul))-[id=4] was assaulted in (his)-[id=5] home in Bowling Green, Kentucky, on Friday, according to Kentucky State Police.
State troopers responded to a call to (the senator's)-[id=10] residence at 3:21 p.m. Friday.
Police arrested a man named Rene Albert Boucher, who they allege "(intentionally assaulted" Paul)-[id=16], causing him "minor injury.
```
In the string representation, coreferences are marked with parenthesis and the referent with double parenthesis.
Each is also labelled with a `coref_id`. Let's have a closer look at the referent.
```python
ref = chain.referent
print('Coreference: {}\n'.format(ref))

for attr in 'type', 'number', 'animacy', 'gender':
print(attr, getattr(ref, attr), sep=': ')

# Note that we can also index coreferences by id
assert chain[4].is_referent
```
Output:
```
Coreference: Police

type: PROPER
number: SINGULAR
animacy: ANIMATE
gender: UNKNOWN
```

#### Quotes
Extracting quotes from the text is simple.
```python
print(document.quotes)
```
Output:
```
[<Quote: "intentionally assaulted">, <Quote: "minor injury">]
```

#### TODO (annotation wrappers):
- [x] ssplit
- [ ] ner
- [x] pos
- [x] lemma
- [x] coref
- [x] quote
- [ ] quote.attribution
- [ ] parse
- [ ] depparse
- [x] entitymentions
- [ ] openie
- [ ] sentiment
- [ ] relation
- [ ] kbp
- [ ] entitylink
- [ ] 'options' examples i.e openie.resolve_coref

### Saving annotations
#### Write
A pynlp document can be saved as a byte string.
```python
with open('annotation.dat', 'wb') as file:
file.write(document.to_bytes())
```
#### Read
To load a pynlp document, instantiate a `Document` with the `from_bytes` class method.
```python
from pynlp import Document

with open('annotation.dat', 'rb') as file:
document = Document.from_bytes(file.read())
```

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.2

Jun 29, 2018

This version

0.4.1

Jun 29, 2018

0.4.0

Apr 16, 2018

0.3.9

Apr 14, 2018

0.3.5

Nov 26, 2017

0.3.4.2

Nov 12, 2017

0.3.4.1

Nov 8, 2017

0.3.4

Nov 8, 2017

0.3.3

Nov 5, 2017

0.3.2

Nov 5, 2017

0.3.1

Nov 5, 2017

0.3.0

Nov 5, 2017

0.2

Oct 30, 2017

0.2.0

Oct 30, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pynlp-0.4.1.tar.gz (24.3 kB view hashes)

Uploaded Jun 29, 2018 Source

Built Distribution

pynlp-0.4.1-py3-none-any.whl (23.1 kB view hashes)

Uploaded Jun 29, 2018 Python 3

Hashes for pynlp-0.4.1.tar.gz

Hashes for pynlp-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`84fb1fc9d1b416f24d99f37b6b2929bd06cfe10f8ca0fb78823734068b429c4d`
MD5	`ddae522d398c830034d51e8f35f24096`
BLAKE2b-256	`271d817ad98b1da655f0f6b4b1e1dda30524bbd798eaed0e6514686bfb316992`

Hashes for pynlp-0.4.1-py3-none-any.whl

Hashes for pynlp-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7472f33f547dae77242406691c32be720669d92aa535188a5ca85c4a4d35b892`
MD5	`7c6751b02e154d437983f88d86f9f1be`
BLAKE2b-256	`0032038febfbd9c7be7e904f5e6a88c78aa6cdddd4d9c4773ac524f84db45b04`