Skip to main content

Literary Language Toolkit (LLTK): corpora, models, and tools for the digital humanities

Project description

Literary Language Toolkit (LLTK)

A Python package for computational literary analysis and digital humanities research. Provides 50+ literary corpora, text processing tools, and analysis methods including word frequencies, document-term matrices, cross-corpus linking, deduplication, and a centralized DuckDB metadata store for querying 2M+ texts across all corpora.

Package: lltk-dh on PyPI | License: MIT | Python: >=3.8

Install

pip install -U lltk-dh

# or latest from source:
pip install -U git+https://github.com/quadrismegistus/lltk

Quick start

import lltk

# List available corpora
lltk.show()

# Load a corpus
c = lltk.load('ecco_tcp')

# Metadata as a pandas DataFrame
c.meta
c.meta.query('1770 < year < 1830')

# Iterate texts
for t in c.texts():
    print(t.id, t.author, t.title, t.year)
    print(t.txt[:200])
    print(t.freqs())       # word frequencies (Counter)

# Corpus-level analysis
mfw = c.mfw(n=10000)              # top 10K words across corpus
dtm = c.dtm(n=10000)              # document-term matrix (DataFrame)
dtm = c.dtm(n=10000, tfidf=True)  # TF-IDF weighted

Installing corpus data

Corpora live at ~/lltk_data/corpora/<corpus_id>/. Each has: metadata.csv, txt/, and optionally xml/, freqs/. Some corpora are freely downloadable; others require institutional access.

# Download a corpus (metadata + freqs ≈ 150 MB for ecco_tcp)
lltk install ecco_tcp --parts metadata,freqs

# Full texts (adds ~600 MB)
lltk install ecco_tcp --parts txt

Texts

for t in c.texts():
    t.id                    # text identifier
    t.author                # metadata attributes
    t.title
    t.year

    t.txt                   # plain text as string
    t.xml                   # XML source (if available)
    t.freqs()               # word frequencies (Counter)

# Direct access by ID
t = c.text('some_text_id')

Sections

Texts can be split into structural sections (chapters, letters, etc.) from XML, or into paragraphs and fixed-length passages:

# Chapters from XML (auto-detects <div>, <chapter>, <letter>, etc.)
for ch in t.chapters.texts():
    print(ch.get('title'), ch.txt[:100])

# Paragraphs (split on blank lines)
for p in t.paragraphs.texts():
    print(p.id, len(p.txt))

# Passages of ~500 words (respects sentence boundaries)
for p in t.passages(n=500).texts():
    print(p.id, p.get('num_words'))
    print(p.freqs())

Sections are TextSection objects inside a SectionCorpus — they support all the same methods as regular texts (txt, freqs(), meta, etc.).

Corpus-level analysis

Most frequent words

mfw = c.mfw(n=10000)       # top 10K words across corpus (list)

Document-term matrix

dtm = c.dtm(n=10000)               # raw counts (DataFrame)
dtm = c.dtm(n=10000, tf=True)      # term frequencies
dtm = c.dtm(n=10000, tfidf=True)   # TF-IDF weighted

Returns a pandas DataFrame: rows = text IDs, columns = words.

Duplicate detection

Find near-duplicate texts within a corpus using cosine similarity on TF-IDF word frequency vectors. Works even on corpora with only precomputed freqs/ (no raw text needed).

dupes = c.find_duplicates(
    n=5000,            # number of MFW features
    threshold=0.85,    # minimum cosine similarity
    k=10,              # max neighbors per text
)
# Returns DataFrame: id_1, id_2, similarity (sorted descending)

Metadata

Loading metadata

c.meta returns a pandas DataFrame loaded from the corpus's metadata.csv. Corpus subclasses can override load_metadata() to enrich columns without altering the CSV:

c = lltk.load('estc')
c.meta  # includes enriched columns: format_std, num_pages, is_fiction, etc.

Custom metadata loading

Override load_metadata() in a corpus subclass:

class MyCorpus(BaseCorpus):
    def load_metadata(self):
        meta = super().load_metadata()
        meta['genre'] = 'Fiction'
        meta['decade'] = (meta['year'] // 10) * 10
        return meta

Results are cached — subsequent calls to c.meta or c.load_metadata() return the cached DataFrame.

Cross-corpus linking

Corpora can declare shared-ID relationships for linking texts across collections. This supports two patterns:

Metadata merging (many-to-one)

When many texts in corpus A map to one record in corpus B (e.g., many ECCO texts → one ESTC catalogue record), corpus A can merge B's metadata as prefixed columns:

c = lltk.load('ecco')
c.meta
# DataFrame includes: ESTCID, estc_author, estc_format_std, estc_is_fiction, ...

This is configured declaratively on the corpus class:

class ECCO(BaseCorpus):
    LINKS = {'estc': ('ESTCID', 'id_estc')}  # my_col → their_col

    def load_metadata(self):
        meta = super().load_metadata()
        return self.merge_linked_metadata(meta)

Text traversal (one-to-many)

When one record maps to many texts in another corpus (e.g., one ESTC record → multiple ECCO editions), use t.linked():

c = lltk.load('estc')
t = c.text('T089174')

# Find all ECCO texts for this ESTC record
ecco_texts = t.linked('ecco')
for et in ecco_texts:
    print(et.id, et.title, et.year)

# Find all EEBO texts
eebo_texts = t.linked('eebo_tcp')

Currently linked corpora

Source Target Link column Direction
ECCO ESTC ESTCIDid_estc metadata merge
EEBO_TCP ESTC id_stcid_estc metadata merge
ESTC ECCO id_estcESTCID text traversal
ESTC EEBO_TCP id_estcid_stc text traversal

MetaDB (centralized DuckDB metadata store)

lltk.db is a DuckDB-backed metadata cache that indexes all corpora into a single queryable store. It enables fast single-row lookups, cross-corpus queries, title/author matching for deduplication, genre enrichment from bibliography corpora, and virtual corpus construction.

Building the database

lltk db-rebuild                          # ingest all corpora (~4 min)
lltk db-rebuild estc ecco                # re-ingest specific corpora
lltk db-info                             # genre × corpus crosstab
lltk db-match                            # cross-corpus dedup matching (~5 min)
lltk db-enrich-genres                    # propagate genre from bibliographies
lltk db-wordcounts                       # compute word counts from freqs

Querying

import lltk

# Single-row lookup
lltk.db.get('_estc/T012345')

# SQL queries on the texts table
lltk.db.query("SELECT * FROM texts WHERE year < 1700 AND genre = 'Fiction'")
lltk.db.query("SELECT corpus, COUNT(*) as n FROM texts GROUP BY corpus")

# Iterate text objects with filters + dedup
for t in lltk.db.texts(genre='Fiction', year_min=1600, year_max=1800, dedup=True):
    print(t.corpus.id, t.title, t.year)
    print(t.freqs())   # resolves through source corpus

# As DataFrame (no text objects)
df = lltk.db.texts_df(genre='Fiction', dedup=True)

# As a corpus object (supports .mfw, .dtm, .meta)
fiction = lltk.db.corpus(genre='Fiction', dedup=True)

Text objects returned by lltk.db.texts() keep their original corpus reference, so t.txt, t.freqs(), and file paths all resolve through the source corpus.

Cross-corpus matching

Matching finds duplicate/reprint texts across corpora via multiple tiers: exact title+author, containment (short title within long title), and optional fuzzy matching (Jaro-Winkler). Connected components are grouped and ranked by corpus source preference.

lltk.db.match()                          # exact + containment matching
lltk.db.find_matches('Incognita')        # search match groups by title
lltk.db.get_group('_estc/T012345')       # all texts in same match group

SyntheticCorpus (virtual corpora from DB queries)

Declarative corpus class that pulls texts from multiple source corpora, deduplicated:

from lltk.corpus.synthetic import SyntheticCorpus

class BigFiction(SyntheticCorpus):
    ID = 'big_fiction'
    NAME = 'BigFiction'
    SOURCES = {
        'chadwyck': {'genre': 'Fiction'},
        'ecco': {'genre': 'Fiction'},
        'hathi_englit': {'genre': 'Fiction'},
    }
    DEDUP = True
    DEDUP_BY = 'oldest'

C = BigFiction()
C.meta                    # DataFrame — all fiction, deduplicated
for t in C.texts():
    t.txt[:100]           # resolves through source corpus paths

Architecture

lltk/
├── imports.py          # Constants, config, third-party imports
├── __init__.py         # Package entry point
├── text/
│   ├── text.py         # BaseText, TextSection, Text() factory
│   ├── textlist.py     # TextList collection class
│   └── utils.py        # Tokenization, XML parsing, text utilities
├── corpus/
│   ├── corpus.py       # BaseCorpus, SectionCorpus, Corpus() factory
│   ├── synthetic.py    # SyntheticCorpus — virtual corpora from DuckDB queries
│   ├── utils.py        # load_corpus(), manifest loading, corpus discovery
│   ├── manifest.txt    # Corpus registry (configparser format)
│   └── <corpus_name>/  # Per-corpus implementations (50+)
├── model/
│   ├── preprocess.py   # Preprocessing (XML→TXT, TXT→freqs)
│   ├── matcher.py      # Text matching/dedup
│   └── ...             # word2vec, doc2vec, networks, etc.
└── tools/
    ├── baseobj.py      # BaseObject (root class)
    ├── tools.py        # Config, utilities, parallel processing
    ├── db.py           # Local DB backends
    ├── metadb.py       # DuckDB centralized metadata store (lltk.db)
    └── logs.py         # Logging

Key patterns

  • Inheritance: BaseObjectTextListBaseCorpus → corpus subclasses
  • Factories: Text(id) and Corpus(id) return cached objects
  • Lazy loading: Metadata loaded on first access via load_metadata(). Text metadata hydrated lazily on first attribute access.
  • Path resolution: corpus.path_* attributes resolved via __getattr__get_path()
  • Manifest: Corpora registered in manifest.txt (configparser). Multiple manifest files merged from package dir, ~/lltk_data/, and user config.
  • Metadata enrichment: Override load_metadata() → call super() → transform DataFrame → return
  • Cross-corpus links: LINKS dict + merge_linked_metadata() for joins, t.linked() for traversal
  • Parquet caching: Metadata CSVs cached as .parquet for 5-10x faster subsequent reads

CLI reference

lltk show                                # list available corpora
lltk install <corpus> [--parts ...]      # download corpus data
lltk compile <corpus>                    # compile corpus from raw sources
lltk preprocess <corpus> --parts txt     # XML→TXT conversion
lltk preprocess <corpus> --parts freqs   # TXT→word frequencies

lltk db-rebuild [corpus ...]             # rebuild DuckDB metadata store
lltk db-info                             # genre × corpus crosstab
lltk db-match [--fuzzy]                  # cross-corpus dedup matching
lltk db-enrich-genres                    # propagate genre from bibliographies
lltk db-wordcounts [-j N]               # compute word counts from freqs
lltk db-matches "title"                  # search matches by title
lltk db-match-stats                      # matching statistics

lltk annotate <corpus> [--port N]        # launch annotation web app

Development

Running tests

pip install pytest
python -m pytest tests/ -v
python -m pytest tests/ --cov=lltk --cov-report=term   # with coverage

199 tests using the test_fixture corpus (Blake, Austen, Shelley) checked into the repo — no external data needed.

Adding a new corpus

  1. Create lltk/corpus/my_corpus/my_corpus.py:
from lltk.imports import *

class TextMyCorpus(BaseText):
    pass

class MyCorpus(BaseCorpus):
    TEXT_CLASS = TextMyCorpus

    def load_metadata(self):
        meta = super().load_metadata()
        # add/transform columns here
        return meta
  1. Register in lltk/corpus/manifest.txt:
[MyCorpus]
id = my_corpus
name = MyCorpus
desc = Description of the corpus
path_python = my_corpus/my_corpus.py
class_name = MyCorpus
  1. Place data at ~/lltk_data/corpora/my_corpus/:
    • metadata.csv — with id column + any metadata columns
    • txt/ — text files as <text_id>.txt
    • freqs/ — (optional) precomputed word frequencies as JSON

Available corpora

LLTK has built in functionality for the following corpora. Some (🌞) are freely downloadable from the links below or the LLTK interface. Some of them (☂) require first accessing the raw data through your institutional or other subscription. Some corpora have a mixture, with some data open through fair research use (e.g. metadata, freqs) and some closed (e.g. txt, xml, raw).

name desc license metadata freqs txt xml raw
ARTFL American and French Research on the Treasury of the French Language Academic ☂️ ☂️
BPO British Periodicals Online Commercial ☂️ ☂️
CLMET Corpus of Late Modern English Texts Academic 🌞 🌞 ☂️ ☂️
COCA Corpus of Contemporary American English Commercial ☂️ ☂️ ☂️ ☂️
COHA Corpus of Historical American English Commercial ☂️ ☂️ ☂️ ☂️
Chadwyck Chadwyck-Healey Fiction Collections Mixed 🌞 🌞 ☂️ ☂️ ☂️
ChadwyckDrama Chadwyck-Healey Drama Collections Mixed ☂️ ☂️ ☂️ ☂️ ☂️
ChadwyckPoetry Chadwyck-Healey Poetry Collections Mixed ☂️ ☂️ ☂️ ☂️ ☂️
Chicago U of Chicago Corpus of C20 Novels Academic 🌞 🌞 ☂️
DTA Deutsches Text Archiv Free 🌞 🌞 🌞 🌞 🌞
DialNarr Dialogue and Narration separated in Chadwyck-Healey Novels Academic 🌞 🌞 ☂️
EarlyPrint EarlyPrint Project: EEBO/ECCO/Evans TCP with linguistic tagging Free 🌞 🌞 🌞 🌞
ECCO Eighteenth Century Collections Online Commercial ☂️ ☂️ ☂️ ☂️ ☂️
ECCO_TCP ECCO (Text Creation Partnership) Free 🌞 🌞 🌞 🌞 🌞
EEBO_TCP Early English Books Online (curated by the Text Creation Partnership) Free 🌞 🌞 🌞 🌞
ESTC English Short Title Catalogue (481K bibliographic records, metadata-only) Academic ☂️
EnglishDialogues A Corpus of English Dialogues, 1560-1760 Academic 🌞 🌞 🌞
EvansTCP Early American Fiction Free 🌞 🌞 🌞 🌞 🌞
GaleAmericanFiction Gale American Fiction, 1774-1920 Academic 🌞 🌞 ☂️ ☂️
GildedAge U.S. Fiction of the Gilded Age Academic 🌞 🌞 🌞
HathiBio Biographies from Hathi Trust Academic 🌞 🌞
HathiEngLit Fiction, drama, verse word frequencies from Hathi Trust Academic 🌞 🌞
HathiEssays Hathi Trust volumes with "essay(s)" in title Academic 🌞 🌞
HathiLetters Hathi Trust volumes with "letter(s)" in title Academic 🌞 🌞
HathiNovels Hathi Trust volumes with "novel(s)" in title Academic 🌞 🌞
HathiProclamations Hathi Trust volumes with "proclamation(s)" in title Academic 🌞 🌞
HathiSermons Hathi Trust volumes with "sermon(s)" in title Academic 🌞 🌞
HathiStories Hathi Trust volumes with "story/stories" in title Academic 🌞 🌞
HathiTales Hathi Trust volumes with "tale(s)" in title Academic 🌞 🌞
HathiTreatises Hathi Trust volumes with "treatise(s)" in title Academic 🌞 🌞
InternetArchive 19th Century Novels, curated by the U of Illinois and hosted on the Internet Archive Free 🌞 🌞 🌞
LitLab Literary Lab Corpus of 18th and 19th Century Novels Academic 🌞 🌞 ☂️
MarkMark Mark Algee-Hewitt's and Mark McGurl's 20th Century Corpus Academic 🌞 🌞 ☂️
OldBailey Old Bailey Online Free 🌞 🌞 🌞 🌞
RavenGarside Raven & Garside's Bibliography of English Novels, 1770-1830 Academic ☂️
SOTU State of the Union Addresses Free 🌞 🌞 🌞
Sellers 19th Century Texts compiled by Jordan Sellers Free 🌞 🌞 🌞
SemanticCohort Corpus used in "Semantic Cohort Method" (2012) Free 🌞
Spectator The Spectator (1711-1714) Free 🌞 🌞 🌞
TedJDH Corpus used in "Emergence of Literary Diction" (2012) Free 🌞 🌞 🌞
TxtLab A multilingual dataset of 450 novels Free 🌞 🌞 🌞 🌞

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lltk_dh-0.7.1.tar.gz (6.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lltk_dh-0.7.1-py3-none-any.whl (528.5 kB view details)

Uploaded Python 3

File details

Details for the file lltk_dh-0.7.1.tar.gz.

File metadata

  • Download URL: lltk_dh-0.7.1.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for lltk_dh-0.7.1.tar.gz
Algorithm Hash digest
SHA256 250c8dc05f77996a82a47474f1fa50b9c4df66c4618d5852db79201172e06e8c
MD5 e4d4cbde43f24d12ce0ce104cadfadf0
BLAKE2b-256 487bfa66ca3ec0f8cdc1ef4abd54fe3fbd119038d60e8578e935c4cbbde0fc52

See more details on using hashes here.

File details

Details for the file lltk_dh-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: lltk_dh-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 528.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for lltk_dh-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f1dab3634a24e08e14ab3789afbd305eb58c001db49e9073af629ae60ae65196
MD5 9851dd944871023a39bc7b278371053f
BLAKE2b-256 180ae656a1afa9440e9d6a92bcc01898bbc915147d443b332a1d93b4604a0309

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page