Skip to main content

Literary Language Processing (LLP): corpora, models, and tools for the digital humanities

Project description

llp

Literary Language Processing (LLP): corpora, models, and tools for the digital humanities.

Make a corpus

If you have a folder of plain text files, and an accompanying metadata file,

from llp.corpus.default import PlainTextCorpus

corpus = PlainTextCorpus(
	path_txt='texts',              # path to a folder of txt files
	path_metadata='metadata.xls',  # path to a metadata CSV, TSV, XLS, XLSX file
	col_fn='filename'              # column in metadata pointing to txt file (relative to `path_txt`)
)

Load a pre-existing corpus

Start working with corpora in a few lines:

# import the llp module
import llp

# load the ECCO-TCP corpus [distributed freely online]
corpus = llp.load('ECCO_TCP')

# don't have it yet?
corpus.download()

Do things with corpora

# get the metadata as a dataframe
df_meta = corpus.metadata

# loop over the texts...
for text in corpus.texts():
    # get a string of that text
    text_str = text.txt

    # get the metadata as a dictionary
    text_meta = text.meta

Do other things with texts

With any text object,

# Get a text
texts = corpus.texts()
text = texts[0]

# Get the plain text as a string
txt = text.txt

# Get the metadata as a dictionary
metadata = text.meta

# Get the word tokens as a list
tokens = text.tokens

# Get the word counts as a dictionary
counts = text.freqs()

# Get the n-gram counts as a dictionary
bigrams = text.freqs_ngram(n=2)

# Get a list of passages mentioning a phrase (Key Word In Context)
passages = text.get_passages(phrases=['labour'])

# Get a spacy (http://spacy.io) representation
text_spacy = text.spacy()

Do other things with corpora

Now that you have a corpus object,

# Get the texts as a list
texts = corpus.texts()

# Get the metadata as a list of dictionaries
metadata = corpus.meta

# Save a list of the most frequent words
corpus.gen_mfw()

# Save a term-document matrix for the top 10000 most frequent words
corpus.gen_freq_table(n=10000)

# Save a list of possible duplicate texts in corpus, by title similarity
corpus.rank_duplicates_bytitle()

# Save a list of possible duplicate texts in corpus, by the content of the text (MinHash)
corpus.rank_duplicates()

Do things with models

# Generate a word2vec model with gensim
w2v_model = corpus.word2vec()
w2v_model.model()

# Save model
w2v_model.save()

# Get the original gensim object
gensim_model = w2v_model.gensim

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llp-0.1.5.tar.gz (109.2 kB view details)

Uploaded Source

File details

Details for the file llp-0.1.5.tar.gz.

File metadata

  • Download URL: llp-0.1.5.tar.gz
  • Upload date:
  • Size: 109.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for llp-0.1.5.tar.gz
Algorithm Hash digest
SHA256 6b365f6658807ef8c648d62bef14c239a972f05920d341be1b1cbf9b255a0238
MD5 c2813f5e32f57661f2eb1e69ae3aa22e
BLAKE2b-256 8d0ac02bd353ed8fb891690e5899b60da8bb4375b9bfbd2796e99b53c3b22be7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page