Skip to main content

Literary Language Toolkit (LLTK): corpora, models, and tools for the digital humanities

Project description

Literary Language Toolkit (LLTK)

Corpora, models, and tools for the study of complex language.

Quickstart

See this notebook for a more interactive quickstart (run the code here on Binder).

Install

Open a terminal, Jupyter, or Colab notebook and type:

pip install -qU lltk-dh

# or for very latest version:
#pip install -qU git+https://github.com/quadrismegistus/lltk

Show available corpora:

lltk show

Or, within python, show in markdown:

import lltk
lltk.show()

Play with corpora

See below for available corpora.

# Load/install a corpus
import lltk
corpus = lltk.load('ECCO_TCP')           # load the corpus by name or ID

# Metadata
meta = corpus.meta                       # metadata as data frame
smpl = meta.query('1770<year<1830')      # easy query access         

# Data
mfw = corpus.mfw()                       # get the 10K most frequent words as a list
dtm = corpus.dtm()                       # get a document-term matrix as a pandas dataframe
dtm = corpus.dtm(tfidf=True)             # get DTM as tf-idf
mdw = corpus.mdw('gender')               # get most distinctive words for a metadata group

Play with texts

# accessing text objs
texts = corpus.texts()                   # get a list of corpus's text objects
texts_smpl = corpus.texts(smpl)          # text objects from df/list of ids 
texts_rad = corpus.au.Radcliffe          # hit "tab" after typing e.g. "Rad" to autocomplete 
text = corpus.t                          # get a random text object from corpus

# metadata access
text_meta = text.meta                    # get text metadata as dictionary
author = text.author                     # get common metadata as attributes    
title = text.title
year = text.year
dec = text.decade                        # few inferred attributes too
dec_str = text.decade_str                # '1890-1900' rather than 1890

# data access
txt = text.txt                           # get plain text as string
xml = text.xml                           # get xml as string

# simple nlp
words  = text.words                      # get list of words (excl punct)
sents = text.sents                       # get list of sentences
counts = text.counts                     # get word counts as dictionary (from JSON if saved)

# other nlp
tnltk = text.nltk                        # get nltk Text object
tblob = text.blob                        # get TextBlob object
tstanza = text.stanza                    # get list of stanza objects (one per para)
tspacy = text.spacy                      # get list of spacy objects (one per para)

Available corpora

LLTK has built in functionality for the following corpora. Some (๐ŸŒž) are freely downloadable from the links below or the LLTK interface. Some of them (โ˜‚) require first accessing the raw data through your institutional or other subscription. Some corpora have a mixture, with some data open through fair research use (e.g. metadata, freqs) and some closed (e.g. txt, xml, raw).

name desc license metadata freqs txt xml raw
ARTFL American and French Research on the Treasury of the French Language Academic โ˜‚๏ธ โ˜‚๏ธ
BPO British Periodicals Online Commercial โ˜‚๏ธ โ˜‚๏ธ
CLMET Corpus of Late Modern English Texts Academic ๐ŸŒž ๐ŸŒž โ˜‚๏ธ โ˜‚๏ธ
COCA Corpus of Contemporary American English Commercial โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ
COHA Corpus of Historical American English Commercial โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ
Chadwyck Chadwyck-Healey Fiction Collections Mixed ๐ŸŒž ๐ŸŒž โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ
ChadwyckDrama Chadwyck-Healey Drama Collections Mixed โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ
ChadwyckPoetry Chadwyck-Healey Poetry Collections Mixed โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ
Chicago U of Chicago Corpus of C20 Novels Academic ๐ŸŒž ๐ŸŒž โ˜‚๏ธ
DTA Deutsches Text Archiv Free ๐ŸŒž ๐ŸŒž ๐ŸŒž ๐ŸŒž ๐ŸŒž
DialNarr Dialogue and Narration separated in Chadwyck-Healey Novels Academic ๐ŸŒž ๐ŸŒž โ˜‚๏ธ
ECCO Eighteenth Century Collections Online Commercial โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ โ˜‚๏ธ
ECCO_TCP ECCO (Text Creation Partnership) Free ๐ŸŒž ๐ŸŒž ๐ŸŒž ๐ŸŒž ๐ŸŒž
EEBO_TCP Early English Books Online (curated by the Text Creation Partnership) Free ๐ŸŒž ๐ŸŒž ๐ŸŒž ๐ŸŒž
ESTC English Short Title Catalogue Academic โ˜‚๏ธ
EnglishDialogues A Corpus of English Dialogues, 1560-1760 Academic ๐ŸŒž ๐ŸŒž ๐ŸŒž
EvansTCP Early American Fiction Free ๐ŸŒž ๐ŸŒž ๐ŸŒž ๐ŸŒž ๐ŸŒž
GaleAmericanFiction Gale American Fiction, 1774-1920 Academic ๐ŸŒž ๐ŸŒž โ˜‚๏ธ โ˜‚๏ธ
GildedAge U.S. Fiction of the Gilded Age Academic ๐ŸŒž ๐ŸŒž ๐ŸŒž
HathiBio Biographies from Hathi Trust Academic ๐ŸŒž ๐ŸŒž
HathiEngLit Fiction, drama, verse word frequencies from Hathi Trust Academic ๐ŸŒž ๐ŸŒž
HathiEssays Hathi Trust volumes with "essay(s)" in title Academic ๐ŸŒž ๐ŸŒž
HathiLetters Hathi Trust volumes with "letter(s)" in title Academic ๐ŸŒž ๐ŸŒž
HathiNovels Hathi Trust volumes with "novel(s)" in title Academic ๐ŸŒž ๐ŸŒž
HathiProclamations Hathi Trust volumes with "proclamation(s)" in title Academic ๐ŸŒž ๐ŸŒž
HathiSermons Hathi Trust volumes with "sermon(s)" in title Academic ๐ŸŒž ๐ŸŒž
HathiStories Hathi Trust volumes with "story/stories" in title Academic ๐ŸŒž ๐ŸŒž
HathiTales Hathi Trust volumes with "tale(s)" in title Academic ๐ŸŒž ๐ŸŒž
HathiTreatises Hathi Trust volumes with "treatise(s)" in title Academic ๐ŸŒž ๐ŸŒž
InternetArchive 19th Century Novels, curated by the U of Illinois and hosted on the Internet Archive Free ๐ŸŒž ๐ŸŒž ๐ŸŒž
LitLab Literary Lab Corpus of 18th and 19th Century Novels Academic ๐ŸŒž ๐ŸŒž โ˜‚๏ธ
MarkMark Mark Algee-Hewitt's and Mark McGurl's 20th Century Corpus Academic ๐ŸŒž ๐ŸŒž โ˜‚๏ธ
OldBailey Old Bailey Online Free ๐ŸŒž ๐ŸŒž ๐ŸŒž ๐ŸŒž
RavenGarside Raven & Garside's Bibliography of English Novels, 1770-1830 Academic โ˜‚๏ธ
SOTU State of the Union Addresses Free ๐ŸŒž ๐ŸŒž ๐ŸŒž
Sellers 19th Century Texts compiled by Jordan Sellers Free ๐ŸŒž ๐ŸŒž ๐ŸŒž
SemanticCohort Corpus used in "Semantic Cohort Method" (2012) Free ๐ŸŒž
Spectator The Spectator (1711-1714) Free ๐ŸŒž ๐ŸŒž ๐ŸŒž
TedJDH Corpus used in "Emergence of Literary Diction" (2012) Free ๐ŸŒž ๐ŸŒž ๐ŸŒž
TxtLab A multilingual dataset of 450 novels Free ๐ŸŒž ๐ŸŒž ๐ŸŒž ๐ŸŒž

Documentation

Incomplete for now. See this sample notebook for some examples.

New corpus

Import a corpus into LLTK:

lltk import                           # use the "import" command \
  -path_txt mycorpus/txts             # a folder of txt files  (use -path_xml for xml) \
  -path_metadata mycorpus/meta.xls    # a metadata csv/tsv/xls about those txt files \
  -col_fn filename                    # .txt/.xml filename col in metadata (use -col_id if no ext)

Or create a new one:

lltk create

Most frequent words

corpus.mfw_df(
    n=None,                            # Number of top words overall
    by_ntext=False,                    # Count number of documents not number of words
    by_fpm=False,                      # Count by within-text relative sums
    min_count=None,                    # Minimum count of word

    yearbin=None,                      # Average relative counts across `yearbin` periods
    col_group='period',                # Which column to periodize on
    n_by_period=None,                  # Number of top words per period
    keep_periods=True,                 # Keep periods in output dataframe
    n_agg='median',                    # How to aggregate across periods
    min_periods=None,                  # minimum number of periods a word must appear in

    excl_stopwords=False,              # Exclude stopwords (at `PATH_TO_ENGLISH_STOPWORDS`)
    excl_top=0,                        # Exclude words ranked 1:`not_top`
    valtype='fpm',                     # valtype to compute top words by
    **attrs
)

Document term matrix

corpus.dtm(
    words=[],                          # words to use in DTM
    n=25000,                           # if not `words`, how many mfw?
    texts=None,                        # set texts to use explicitly, otherwise use all
    tf=False,                          # return term frequencies, not counts
    tfidf=False,                       # return tfidf, not counts
    meta=False,                        # include metadata (e.g. ["gender","nation"])
    **mfw_attrs,                       # all other attributes passed to self.mfw()
)

Most distinctive words

corpus.mdw(                                 
    groupby,                           # metadata categorical variable to group by
    words=[],                          # explicitly set words to use
    texts=None,                        # explicitly set texts to use
    tfidf=True,                        # use tfidf as mdw calculation
    keep_null_cols=False,              # remove texts which do not have `groupby` set
    remove_zeros=True,                 # remove rows summing to zero
    agg='median',                      # aggregate by `agg`
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lltk-dh-0.5.15.tar.gz (45.5 MB view details)

Uploaded Source

File details

Details for the file lltk-dh-0.5.15.tar.gz.

File metadata

  • Download URL: lltk-dh-0.5.15.tar.gz
  • Upload date:
  • Size: 45.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.5

File hashes

Hashes for lltk-dh-0.5.15.tar.gz
Algorithm Hash digest
SHA256 fe78cf42bc381bd6fd0c0b49dc3a832ef5ab73b631fb39024739a510388db574
MD5 b002faf2ee591b1e278e14908d77317b
BLAKE2b-256 3c1c9ed58c01fd634fb74ba1a70b991e0c59904c041e192ca1d708aea342866e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page