Literary Language Processing (LLP): corpora, models, and tools for the digital humanities
Project description
llp
Literary Language Processing (LLP): corpora, models, and tools for the digital humanities.
Make a corpus
If you have a folder of plain text files, and an accompanying metadata file,
from llp.corpus.default import PlainTextCorpus
corpus = PlainTextCorpus(
path_txt='texts', # path to a folder of txt files
path_metadata='metadata.xls', # path to a metadata CSV, TSV, XLS, XLSX file
col_fn='filename' # column in metadata pointing to txt file (relative to `path_txt`)
)
Load a pre-existing corpus
Start working with corpora in a few lines:
# import the llp module
import llp
# load the ECCO-TCP corpus [distributed freely online]
corpus = llp.load('ECCO_TCP')
# don't have it yet?
corpus.download()
Do things with corpora
# get the metadata as a dataframe
df_meta = corpus.metadata
# loop over the texts...
for text in corpus.texts():
# get a string of that text
text_str = text.txt
# get the metadata as a dictionary
text_meta = text.meta
Do other things with texts
With any text object,
# Get a text
texts = corpus.texts()
text = texts[0]
# Get the plain text as a string
txt = text.txt
# Get the metadata as a dictionary
metadata = text.meta
# Get the word tokens as a list
tokens = text.tokens
# Get the word counts as a dictionary
counts = text.freqs()
# Get the n-gram counts as a dictionary
bigrams = text.freqs_ngram(n=2)
# Get a list of passages mentioning a phrase (Key Word In Context)
passages = text.get_passages(phrases=['labour'])
# Get a spacy (http://spacy.io) representation
text_spacy = text.spacy()
Do other things with corpora
Now that you have a corpus object,
# Get the texts as a list
texts = corpus.texts()
# Get the metadata as a list of dictionaries
metadata = corpus.meta
# Save a list of the most frequent words
corpus.gen_mfw()
# Save a term-document matrix for the top 10000 most frequent words
corpus.gen_freq_table(n=10000)
# Save a list of possible duplicate texts in corpus, by title similarity
corpus.rank_duplicates_bytitle()
# Save a list of possible duplicate texts in corpus, by the content of the text (MinHash)
corpus.rank_duplicates()
Do things with models
# Generate a word2vec model with gensim
w2v_model = corpus.word2vec()
w2v_model.model()
# Save model
w2v_model.save()
# Get the original gensim object
gensim_model = w2v_model.gensim
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
llp-0.1.9.tar.gz
(5.4 MB
view details)
File details
Details for the file llp-0.1.9.tar.gz
.
File metadata
- Download URL: llp-0.1.9.tar.gz
- Upload date:
- Size: 5.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08e875b198dc010da026ff26347f3055ed203897d188097727358a39e9bceb16 |
|
MD5 | 37186c1a99275c7f355c5b997547e578 |
|
BLAKE2b-256 | 12aa1865d304190a1e13d615c2f5889bc892687ef73a698c9b5f5aad64d59a33 |