Literary Language Processing (LLP): corpora, models, and tools for the digital humanities
Project description
llp
Literary Language Processing (LLP): corpora, models, and tools for the digital humanities.
Quickstart
Install
Just run pip:
pip install llp
Or if you're newer to Python programming, and prefer to install LLP as part of a text mining "starter pack" of tools and software, check out the LTM Starter Pack.
Configure
To configure, type:
llp configure
By default,
Load
Download a corpus:
llp download ecco_tcp
Then use it:
import llp
corpus = llp.load('ECCO_TCP') # an llp.Corpus object
corpus.metadata # a pandas dataframe
for text in corpus.texts(): # looping over llp.Text objects
print(text.id, text.author, text.year) # print some attributes
# ... (see below for more)
Corpus magic
There's a few ways to create a corpus uing LLP.
1. Downloading pre-existing corpora
To see which corpora are downloadable, run:
llp status
If you see an up arrow next to a type of data, you can
If you have a folder of plain text files, and an accompanying metadata file,
from llp.corpus import Corpus
my_corpus = Corpus(
path_txt='my_texts', # path to a folder of txt files
path_metadata='my_metadata.xls', # path to a metadata CSV, TSV, XLS, XLSX file
col_fn='my_filename_column' # column in metadata pointing to txt file (relative to `path_txt`)
)
Load a pre-existing corpus
Start working with corpora in a few lines:
# import the llp module
import llp
# load the ECCO-TCP corpus [distributed freely online]
corpus = llp.load('ECCO_TCP')
# don't have it yet?
corpus.download()
Do things with corpora
# get the metadata as a dataframe
df_meta = corpus.metadata
# loop over the texts...
for text in corpus.texts():
# get a string of that text
text_str = text.txt
# get the metadata as a dictionary
text_meta = text.meta
Do other things with texts
With any text object,
# Get a text
texts = corpus.texts()
text = texts[0]
# Get the plain text as a string
txt = text.txt
# Get the metadata as a dictionary
metadata = text.meta
# Get the word tokens as a list
tokens = text.tokens
# Get the word counts as a dictionary
counts = text.freqs()
# Get the n-gram counts as a dictionary
bigrams = text.freqs_ngram(n=2)
# Get a list of passages mentioning a phrase (Key Word In Context)
passages = text.get_passages(phrases=['labour'])
# Get a spacy (http://spacy.io) representation
text_spacy = text.spacy()
Do other things with corpora
Now that you have a corpus object,
# Get the texts as a list
texts = corpus.texts()
# Get the metadata as a list of dictionaries
metadata = corpus.meta
# Save a list of the most frequent words
corpus.gen_mfw()
# Save a term-document matrix for the top 10000 most frequent words
corpus.gen_freq_table(n=10000)
# Save a list of possible duplicate texts in corpus, by title similarity
corpus.rank_duplicates_bytitle()
# Save a list of possible duplicate texts in corpus, by the content of the text (MinHash)
corpus.rank_duplicates()
Do things with models
# Generate a word2vec model with gensim
w2v_model = corpus.word2vec()
w2v_model.model()
# Save model
w2v_model.save()
# Get the original gensim object
gensim_model = w2v_model.gensim
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file llp-0.2.0.tar.gz
.
File metadata
- Download URL: llp-0.2.0.tar.gz
- Upload date:
- Size: 5.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cdec383bdb3831b64c203321def9521589855fe36e4d585f1da3163e58a5457e |
|
MD5 | c34a2271b172c3ddf678f776971691a0 |
|
BLAKE2b-256 | 3d50f767cd6189ae647dd73fce24b010d8e818c9e1b2098ed5842475a6eaae82 |