Skip to main content

Connector and tools for working with the ACL sonnet database.

Project description

# database.acriticismlab.org Connector

This repo holds python functions to make the process of connecting to the ACL sonnet database easier. The tools in this
repo can also be installed via pip:

```
pip install AclDatabaseTools
```

## Instructions

**Requirements:**

If you've never used python or written a single line of code before you will need to install the following software
on your PC before you begin. All of these are quite painless to install. If you're dabbling in this world I recommend
you install Atom / Notepad++ from the IDE section; if you're planning to go a bit further, I suggest PyCharm CE.

1. Git: https://git-scm.com/downloads
2. Python 3.6: https://www.python.org/downloads/
* Ensure you install pip, when you install Python the default settings will automatically do this.
3. An IDE or text editor (pick one):
* PyCharm CE: https://www.jetbrains.com/pycharm/download/#section=windows <- free, all the bells and whistles.
(Mac / Windows / *nix)
* Notepad++: https://notepad-plus-plus.org/download/v7.5.6.html <- free, very simple. (Windows)
* Atom: https://atom.io/ <- free, very simple. (Mac / Windows / *nix)
4. A whole lot of patience. If you're new to the world of computer code, this is a good introduction.

Once you have successfully set up all this software, you can move on to the next step.

To install the AclDatabaseTools in PyCharm, open a new project (use default values to create a new project) and press
control-alt-s to open PyCharm's project settings. In the window that pops up, expand "Project {{the name of your
project}}" then click on "Project Interpreter." On the right hand side of the page, click the green "+" (plus) symbol.
Type "AclDatabaseTools" in the search field, click on "AclDatabaseTools" in the results, then click the "Install
Package" button on the bottom of the page.

The first command "moves" your current active directory location to the "root" (i.e. the bottom) of the repository's
directory tree. [Here](https://www.youtube.com/watch?v=hUW5MEKDtMM) is a video on how all this works in laymen's terms.

Depending on how fast your computer is, the second command can take a couple minutes to finish up. This installs all the
"dependencies" the project requires. Think of dependencies like tools, they are bits of code other people have written
to make your life as a programmer easier. You will need them to run the python scripts in this repository.

**Create a file:**

**NOTE:** Items in double "{{" brackets indicate content that will be specific to your machine. Don't copy the content
of the brackets, follow the instructions to find the content you will need for your individual computer. This tutorial
uses PyCharm, if you want to use another IDE you will have to figure out how to do all this via Google-fu.

Now, close the setting windows. Let's create a python file. Along the left hand side of the PyCharm window, you should
see a file tree. Right click on the "top" folder (it should have the same name as your project) and select, New->Python
File. Give the file a name (i.e. "Test") and copy the following data into the file:

```python
"""
Basic demonstration functions for the BeyondAporia db. You must run
'nltk.download()' from a Python interactive console before this will
work.

author: Josh Harkema
"""

import nltk

import AclDatabaseTools.word_tools as wt
from AclDatabaseTools.database_connector import AclDatabaseConnector


def main():
"""
Testing functions for demonstration and teaching.
:return: noting, prints to console.
"""
# Get all the sonnets from the database.
sonnets = AclDatabaseConnector().get_all()

# Tokenize the text of the sonnets into words, remove all
# punctuation and change everything to lower case.
sonnets = wt.tokenize_words(sonnets, True, True)

# Print the total number of words in the database.
print(len(sonnets))

# Remove all the stop words.
sonnets = wt.remove_stop_words(sonnets)

# Plot the frequency of words from the clean word_tokens
frequency = nltk.FreqDist(sonnets)

# Print the plot.
frequency.plot(20, cumulative=False)


main()

```

**Run the code:**

Now we're ready to make some sausage! Right click on the file you just created and select "Run {{your file name}}."

You did it! You ran a python script. The window should now show a 'guess' about what kind of poem we fed it.
Now we'll get into the details of what just happened.

Go to your text editor from step one, browse to the directory of the repository in Windows Explorer / Finder (the file
icon) and open SonnetAnalysis.py.

This file contains a number of elements:

```python
"""
Basic demonstration functions for the BeyondAporia db. You must run
'nltk.download()' from a Python interactive console before this will
work.
```

This first part is the "comments" part of the file. These don't "do" anything but serve as a way to communicate details
about the script.

```python
import nltk

import AclDatabaseTools.word_tools as wt
from AclDatabaseTools.database_connector import AclDatabaseConnector
```

This part is where we start by 'importing,' or making available, some of the dependencies we installed above.

```python
def main():
```

This line is a 'semantic' thing. It places all the code inside what is called a function. Think of a function as a self
contained series of instructions that you can run by typing it's name.

Tabs (indents) are important in python, the indented lines under the 'def main()' line are considered "part"
of the function.

If you look at the last line (notice that it is not indented):
```python
main()
```
This an example of how this works. Using a functions name to execute (run) its instructions is called "calling" the
function: you literally "call" it by its name.

Lets get to the juicy parts, there is a lot happening here. Skip past the """ comments.

```python
sonnets = AclDatabaseConnector().get_sonnets_by_ids("10")
sonnets = wt.tokenize_words(sonnets, True, True)
```

The first line of this code does a tonne. In this single line a connection to the BeyondAporia database is created and
sonnet ID "10" is downloaded to your computer in a format Python understands.
[This](https://beyondaporia.com/sonnets/by_id/10) is what the computer sees. If you look closely, you'll be able to
pick out Shakespeare's *Sonnet II*. This single line of code gets all the data you'll need to perform an analysis of
this single sonnet (we will cover how to get larger volumes of sonnets later.)

The second line of code turns this sonnet into a series of tokens. In this case it divides the sonnet's text into its
lines and stores them in a way that can be analyzed. The [NLTK](https://www.nltk.org/) requires text to be converted
into 'tokens' before you can analyze them. It's not super important to understand exactly what this means at this point.
But keep in mind a 'token' can be a single letter, a word, a group of words, a sentence, a paragraph, a page, a chapter,
a book, a large number of books, etc. The size of a 'token' is entirely arbitrary. In this case we're using lines of
poetry because this is the easiest way to analyze rhyme.

The stuff in brackets "()" are called parameters. In the first line the parameter "10" indicates the sonnet ID you want.
In the second line the parameter "sonnets" tells "tokenize_lines" you want to tokenize the sonnets you retrieved in the
first line. The parameter "True" (note capital "T") tells "tokenize_lines" you want to remove all the punctuation;
changing this parameter to "False" (note capital "F) tells "tokenize_lines" you want to keep the punctuation. The second
"True" indicates you want to transform all characters (i.e. letters) to lower case. Parameters are *always* comma
separated. Forgetting the comma is something you're going to do a lot. You're IDE / text editor will
underline the mistake (just like MS Word) when you do this.

So lets break this all down:

```python
sonnets =
```
By placing a word (almost any word will do) on the left hand site of an '=' sign I'm telling the computer to 'do'
something and store the results in the 'variable' (word) on the left hand side of the '=' sign.

```python
AclDatabaseConnector().get_sonnets_by_ids("10")
```
This tells python I need to use the Connector() class (a library from above) to get_sonnets_by_ids that match the
number "10". You can add more than one sonnet to this string (i.e "10,11,12,13") will get you these four sonnets. There
are easier ways to get large volumes of sonnets:

```python
AclDatabaseConnector().get_all()
```
Will give you the entire database. These 'functions' (everything after the '.' behind Connector()) will be covered
in more detail below.

The next lines:

```python
# Plot the frequency of words from the clean word_tokens
frequency = nltk.FreqDist(sonnets)

# Print the plot.
frequency.plot(20, cumulative=False)
```
Tells python to "print" the results of FreqDist method to a chart in your PyCharm IDE when the "main()" function is run.

Wow, that's a lot to take in. And this is where this tutorial ends. There are 1,000's of great tutorials online about
how to use the NLTK. I suggest [this](https://dzone.com/articles/nlp-tutorial-using-python-nltk-simple-examples)
tutorial as a good next step.

Once you have a better handle on the NLTK you can use the helper scripts (WordTools.py and connector.py) to connect
to the database and do your own analysis.

### Gotcha's

* Python doesn't care about ' (single-quotes) or " (double-quotes) they both do the same thing.
* If something isn't working make sure your function calls have the () after them, this is the most common beginner
mistake.
* If something still isn't working check your tabs. Tabs "nest" in python:

```python
def main():
# do main() stuff.
for x in y:
# do x stuff
for z in a:
# do z stuff.
# GOTCHA! also does x stuff.
# GOTCHA! does main() stuff.
```
note: this code doesn't run, it's called pseudo-code and serves as illustration.

### The helper methods

**How to use the helper scripts:**

1. Copy the 'connector.py' and 'WordTools.py' files into the same directory as the python script you're writing for your
analysis.
2. Add the following lines to the top of the script you're writing:

```python
import nltk

import AclDatabaseTools.word_tools as wt
from AclDatabaseTools.database_connector import AclDatabaseConnector
```

**connector.py**

These methods are used to get data from the BeyondAporia database.

All sonnets:
```python
sonnets = AclDatabaseConnector().get_all()
```

Sonnets by author's last name (replace Shakespeare with the author you want):
```python
sonnets = AclDatabaseConnector().get_author_last_name("Shakespeare")
```

Sonnets by author's first name:
```python
sonnets = AclDatabaseConnector().get_author_first_name("Bob")
```

Sonnets by id (id's can be looked up [here](https://beyondaporia.com/lookup/csv)):
```python
sonnets = AclDatabaseConnector().get_sonnets_by_ids("3,5,7,98")
```
note: you can select one sonnet or 1,000. The number of ids is unlimited.

Sonnet by submitter:
```python
sonnets = AclDatabaseConnector().get_sonnets_by_user("jharkema")
```
note: if you need to get 'your' sonnets from the db, this is how to do it.

Sonnet Search:
```python
sonnets = AclDatabaseConnector().search(first_name="william", last_name="shakespeare")
```

Available fields (you **do not need to enter them all, you can use one field or all five** it's up to you):

* first_name = the author's first name.
* last_name = the author's last name.
* title = the title of the sonnet.
* period = the period the sonnet was published in.
* text = search the body of the sonnet.

**WordTools.py**

These methods do things to make tokenizing and cleaning data easier. Here are some examples. All of these
methods can be preformed on any response from the Connector() class we used above. You should put one of the lines
in the "connector.py" writeup directly above one of the lines below.

Remove punctuation:
```python
sonnets = wt.remove_punctuation(sonnets)
```

Remove 'stop words' (like 'a', 'the', 'as', etc.):
```python
sonnets = wt.remove_stop_words(sonnets)
```
note: you can see the full list of stop words by running print(stopwords) in a python script after importing them
from the NLTK.

Tokenize into words:
```python
word_list = [] <- this creates a place to store the words
word_list = wt.tokenize_words(sonnets)
```

Tokenize into lines (poetry):
```python
line_list = []
line_list = wt.tokenize_lines(sonnets)
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
AclDatabaseTools-0.0.3-py3-none-any.whl (8.8 kB) Copy SHA256 hash SHA256 Wheel py3 Jul 4, 2018
AclDatabaseTools-0.0.3.tar.gz (8.9 kB) Copy SHA256 hash SHA256 Source None Jul 4, 2018

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page