Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

Literary Language Processing (LLP): corpora, models, and tools for the digital humanities

Project description

llp

Literary Language Processing (LLP): corpora, models, and tools for the digital humanities.

Quickstart

  1. Install:
pip install llp                       # install with pip in terminal
  1. Download an existing corpus...
llp status                            # show which corpora/data are available
llp download ECCO_TCP                 # download a corpus

...or import your own:

llp import                            # use the "import" command \
  -path_txt mycorpus/txts             # a folder of txt files  (use -path_xml for xml) \
  -path_metadata mycorpus/meta.xls    # a metadata csv/tsv/xls about those txt files \
  -col_fn filename                    # filename in the metadata corresponding to the .txt filename

...or start a new one:

llp create                            # then follow the interactive prompt
  1. Then you can load the corpus in Python:
import llp                            # import llp as a python module
corpus = llp.load('ECCO_TCP')         # load the corpus by name or ID

...and play with convenient Corpus objects...

df = corpus.metadata                  # get corpus metadata as a pandas dataframe
smpl=df.query('1740 < year < 1780')   # do a quick query on the metadata

texts = corpus.texts()                # get a convenient Text object for each text
texts_smpl = corpus.texts(smpl.id)    # get Text objects for a specific list of IDs

...and Text objects:

for text in texts_smpl:               # loop over Text objects
    text_meta = text.meta             # get text metadata as dictionary
    author = text.author              # get common metadata as attributes    

    txt = text.txt                    # get plain text as string
    xml = text.xml                    # get xml as string

    tokens = text.tokens              # get list of words (incl punct)
    words  = text.words               # get list of words (excl punct)
    counts = text.word_counts         # get word counts as dictionary (from JSON if saved)
    ocracc = text.ocr_accuracy        # get estimate of ocr accuracy
    
    spacy_obj = text.spacy            # get a spacy text object
    nltk_obj = text.nltk              # get an nltk text object
    blob_obj = text.blob              # get a textblob object

Corpus magic

Each corpus object can generate data about itself:

corpus.save_metadata()                # save metadata from xml files (if possible)
corpus.save_plain_text()              # save plain text from xml (if possible)
corpus.save_mfw()                     # save list of all words in corpus and their total  count
corpus.save_freqs()                   # save counts as JSON files
corpus.save_dtm()                     # save a document-term matrix with top N words

You can also run these commands in the terminal:

llp install my_corpus                 # this is equivalent to python above
llp install my_corpus -parallel 4     # but can access parallel processing with MPI/Slingshot
llp install my_corpus dtm             # run a specific step

Generating this kind of data allows for easier access to things like:

mfw = corpus.mfw(n=10000)             # get the 10K most frequent words
dtm = corpus.freqs(words=mfw)         # get a document-term matrix as a pandas dataframe

You can also build word2vec models:

w2v_model = corpus.word2vec()         # get an llp word2vec model object
w2v_model.model()                     # run the modeling process
w2v_model.save()                      # save the model somewhere
gensim_model = w2v_model.gensim       # get the original gensim object

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for llp, version 0.2.2
Filename, size File type Python version Upload date Hashes
Filename, size llp-0.2.2.tar.gz (5.4 MB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page