Literary Language Processing (LLP): corpora, models, and tools for the digital humanities
Project description
llp
Literary Language Processing (LLP): corpora, models, and tools for the digital humanities.
Quickstart
- Install:
pip install llp # install with pip in terminal
- Download an existing corpus...
llp status # show which corpora/data are available
llp download ECCO_TCP # download a corpus
...or import your own:
llp import # use the "import" command \
-path_txt mycorpus/txts # a folder of txt files (use -path_xml for xml) \
-path_metadata mycorpus/meta.xls # a metadata csv/tsv/xls about those txt files \
-col_fn filename # filename in the metadata corresponding to the .txt filename
...or start a new one:
llp create # then follow the interactive prompt
- Then you can load the corpus in Python:
import llp # import llp as a python module
corpus = llp.load('ECCO_TCP') # load the corpus by name or ID
...and play with convenient Corpus objects...
df = corpus.metadata # get corpus metadata as a pandas dataframe
smpl=df.query('1740 < year < 1780') # do a quick query on the metadata
texts = corpus.texts() # get a convenient Text object for each text
texts_smpl = corpus.texts(smpl.id) # get Text objects for a specific list of IDs
...and Text objects:
for text in texts_smpl: # loop over Text objects
text_meta = text.meta # get text metadata as dictionary
author = text.author # get common metadata as attributes
txt = text.txt # get plain text as string
xml = text.xml # get xml as string
tokens = text.tokens # get list of words (incl punct)
words = text.words # get list of words (excl punct)
counts = text.word_counts # get word counts as dictionary (from JSON if saved)
ocracc = text.ocr_accuracy # get estimate of ocr accuracy
spacy_obj = text.spacy # get a spacy text object
nltk_obj = text.nltk # get an nltk text object
blob_obj = text.blob # get a textblob object
Corpus magic
Each corpus object can generate data about itself:
corpus.save_metadata() # save metadata from xml files (if possible)
corpus.save_plain_text() # save plain text from xml (if possible)
corpus.save_mfw() # save list of all words in corpus and their total count
corpus.save_freqs() # save counts as JSON files
corpus.save_dtm() # save a document-term matrix with top N words
You can also run these commands in the terminal:
llp install my_corpus # this is equivalent to python above
llp install my_corpus -parallel 4 # but can access parallel processing with MPI/Slingshot
llp install my_corpus dtm # run a specific step
Generating this kind of data allows for easier access to things like:
mfw = corpus.mfw(n=10000) # get the 10K most frequent words
dtm = corpus.freqs(words=mfw) # get a document-term matrix as a pandas dataframe
You can also build word2vec models:
w2v_model = corpus.word2vec() # get an llp word2vec model object
w2v_model.model() # run the modeling process
w2v_model.save() # save the model somewhere
gensim_model = w2v_model.gensim # get the original gensim object
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
llp-0.2.2.tar.gz
(5.4 MB
view hashes)