Literary Language Toolkit (LLTK): corpora, models, and tools for the digital humanities

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Project description

Literary Language Toolkit (LLTK)

A Python package for computational literary analysis and digital humanities research. Provides 50+ literary corpora, text processing tools, and analysis methods including word frequencies, document-term matrices, cross-corpus linking, deduplication, and a centralized DuckDB metadata store for querying 2M+ texts across all corpora.

Package: lltk-dh on PyPI | License: MIT | Python: >=3.8

Install

pip install -U lltk-dh

# or latest from source:
pip install -U git+https://github.com/quadrismegistus/lltk

Quick start

import lltk

# List available corpora
lltk.show()

# Load a corpus
c = lltk.load('ecco_tcp')

# Metadata as a pandas DataFrame
c.meta
c.meta.query('1770 < year < 1830')

# Iterate texts
for t in c.texts():
    print(t.id, t.author, t.title, t.year)
    print(t.txt[:200])
    print(t.freqs())       # word frequencies (Counter)

# Corpus-level analysis
mfw = c.mfw(n=10000)              # top 10K words across corpus
dtm = c.dtm(n=10000)              # document-term matrix (DataFrame)
dtm = c.dtm(n=10000, tfidf=True)  # TF-IDF weighted

Installing corpus data

Corpora live at ~/lltk_data/corpora/<corpus_id>/. Each has: metadata.csv, txt/, and optionally xml/, freqs/. Some corpora are freely downloadable; others require institutional access.

# Download a corpus (metadata + freqs ≈ 150 MB for ecco_tcp)
lltk install ecco_tcp --parts metadata,freqs

# Full texts (adds ~600 MB)
lltk install ecco_tcp --parts txt

Texts

for t in c.texts():
    t.id                    # text identifier
    t.author                # metadata attributes
    t.title
    t.year

    t.txt                   # plain text as string
    t.xml                   # XML source (if available)
    t.freqs()               # word frequencies (Counter)

# Direct access by ID
t = c.text('some_text_id')

Sections

Texts can be split into structural sections (chapters, letters, etc.) from XML, or into paragraphs and fixed-length passages:

# Chapters from XML (auto-detects <div>, <chapter>, <letter>, etc.)
for ch in t.chapters.texts():
    print(ch.get('title'), ch.txt[:100])

# Paragraphs (split on blank lines)
for p in t.paragraphs.texts():
    print(p.id, len(p.txt))

# Passages of ~500 words (respects sentence boundaries)
for p in t.passages(n=500).texts():
    print(p.id, p.get('num_words'))
    print(p.freqs())

Sections are TextSection objects inside a SectionCorpus — they support all the same methods as regular texts (txt, freqs(), meta, etc.).

Corpus-level analysis

Most frequent words

mfw = c.mfw(n=10000)       # top 10K words across corpus (list)

Document-term matrix

dtm = c.dtm(n=10000)               # raw counts (DataFrame)
dtm = c.dtm(n=10000, tf=True)      # term frequencies
dtm = c.dtm(n=10000, tfidf=True)   # TF-IDF weighted

Returns a pandas DataFrame: rows = text IDs, columns = words.

Duplicate detection

Find near-duplicate texts within a corpus using cosine similarity on TF-IDF word frequency vectors. Works even on corpora with only precomputed freqs/ (no raw text needed).

dupes = c.find_duplicates(
    n=5000,            # number of MFW features
    threshold=0.85,    # minimum cosine similarity
    k=10,              # max neighbors per text
)
# Returns DataFrame: id_1, id_2, similarity (sorted descending)

Metadata

Loading metadata

c.meta returns a pandas DataFrame loaded from the corpus's metadata.csv. Corpus subclasses can override load_metadata() to enrich columns without altering the CSV:

c = lltk.load('estc')
c.meta  # includes enriched columns: format_std, num_pages, is_fiction, etc.

Custom metadata loading

Override load_metadata() in a corpus subclass:

class MyCorpus(BaseCorpus):
    def load_metadata(self):
        meta = super().load_metadata()
        meta['genre'] = 'Fiction'
        meta['decade'] = (meta['year'] // 10) * 10
        return meta

Results are cached — subsequent calls to c.meta or c.load_metadata() return the cached DataFrame.

Cross-corpus linking

Corpora can declare shared-ID relationships for linking texts across collections. This supports two patterns:

Metadata merging (many-to-one)

When many texts in corpus A map to one record in corpus B (e.g., many ECCO texts → one ESTC catalogue record), corpus A can merge B's metadata as prefixed columns:

c = lltk.load('ecco')
c.meta
# DataFrame includes: ESTCID, estc_author, estc_format_std, estc_is_fiction, ...

This is configured declaratively on the corpus class:

class ECCO(BaseCorpus):
    LINKS = {'estc': ('ESTCID', 'id_estc')}  # my_col → their_col

    def load_metadata(self):
        meta = super().load_metadata()
        return self.merge_linked_metadata(meta)

Text traversal (one-to-many)

When one record maps to many texts in another corpus (e.g., one ESTC record → multiple ECCO editions), use t.linked():

c = lltk.load('estc')
t = c.text('T089174')

# Find all ECCO texts for this ESTC record
ecco_texts = t.linked('ecco')
for et in ecco_texts:
    print(et.id, et.title, et.year)

# Find all EEBO texts
eebo_texts = t.linked('eebo_tcp')

Currently linked corpora

Source	Target	Link column	Direction
ECCO	ESTC	`ESTCID` → `id_estc`	metadata merge
EEBO_TCP	ESTC	`id_stc` → `id_estc`	metadata merge
ESTC	ECCO	`id_estc` → `ESTCID`	text traversal
ESTC	EEBO_TCP	`id_estc` → `id_stc`	text traversal

MetaDB (centralized DuckDB metadata store)

lltk.db is a DuckDB-backed metadata cache that indexes all corpora into a single queryable store. It enables fast single-row lookups, cross-corpus queries, title/author matching for deduplication, genre enrichment from bibliography corpora, and virtual corpus construction.

Building the database

lltk db-rebuild                          # ingest all corpora (~4 min)
lltk db-rebuild estc ecco                # re-ingest specific corpora
lltk db-info                             # genre × corpus crosstab
lltk db-match                            # cross-corpus dedup matching (~5 min)
lltk db-enrich-genres                    # propagate genre from bibliographies
lltk db-wordcounts                       # compute word counts from freqs

Querying

import lltk

# Single-row lookup
lltk.db.get('_estc/T012345')

# SQL queries on the texts table
lltk.db.query("SELECT * FROM texts WHERE year < 1700 AND genre = 'Fiction'")
lltk.db.query("SELECT corpus, COUNT(*) as n FROM texts GROUP BY corpus")

# Iterate text objects with filters + dedup
for t in lltk.db.texts(genre='Fiction', year_min=1600, year_max=1800, dedup=True):
    print(t.corpus.id, t.title, t.year)
    print(t.freqs())   # resolves through source corpus

# As DataFrame (no text objects)
df = lltk.db.texts_df(genre='Fiction', dedup=True)

# As a corpus object (supports .mfw, .dtm, .meta)
fiction = lltk.db.corpus(genre='Fiction', dedup=True)

Text objects returned by lltk.db.texts() keep their original corpus reference, so t.txt, t.freqs(), and file paths all resolve through the source corpus.

Cross-corpus matching

Matching finds duplicate/reprint texts across corpora via multiple tiers: exact title+author, containment (short title within long title), and optional fuzzy matching (Jaro-Winkler). Connected components are grouped and ranked by corpus source preference.

lltk.db.match()                          # exact + containment matching
lltk.db.find_matches('Incognita')        # search match groups by title
lltk.db.get_group('_estc/T012345')       # all texts in same match group

SyntheticCorpus (virtual corpora from DB queries)

Declarative corpus class that pulls texts from multiple source corpora, deduplicated:

from lltk.corpus.synthetic import SyntheticCorpus

class BigFiction(SyntheticCorpus):
    ID = 'big_fiction'
    NAME = 'BigFiction'
    SOURCES = {
        'chadwyck': {'genre': 'Fiction'},
        'ecco': {'genre': 'Fiction'},
        'hathi_englit': {'genre': 'Fiction'},
    }
    DEDUP = True
    DEDUP_BY = 'oldest'

C = BigFiction()
C.meta                    # DataFrame — all fiction, deduplicated
for t in C.texts():
    t.txt[:100]           # resolves through source corpus paths

Architecture

lltk/
├── imports.py          # Constants, config, third-party imports
├── __init__.py         # Package entry point
├── text/
│   ├── text.py         # BaseText, TextSection, Text() factory
│   ├── textlist.py     # TextList collection class
│   └── utils.py        # Tokenization, XML parsing, text utilities
├── corpus/
│   ├── corpus.py       # BaseCorpus, SectionCorpus, Corpus() factory
│   ├── synthetic.py    # SyntheticCorpus — virtual corpora from DuckDB queries
│   ├── utils.py        # load_corpus(), manifest loading, corpus discovery
│   ├── manifest.txt    # Corpus registry (configparser format)
│   └── <corpus_name>/  # Per-corpus implementations (50+)
├── model/
│   ├── preprocess.py   # Preprocessing (XML→TXT, TXT→freqs)
│   ├── matcher.py      # Text matching/dedup
│   └── ...             # word2vec, doc2vec, networks, etc.
└── tools/
    ├── baseobj.py      # BaseObject (root class)
    ├── tools.py        # Config, utilities, parallel processing
    ├── db.py           # Local DB backends
    ├── metadb.py       # DuckDB centralized metadata store (lltk.db)
    └── logs.py         # Logging

Key patterns

Inheritance: BaseObject → TextList → BaseCorpus → corpus subclasses
Factories: Text(id) and Corpus(id) return cached objects
Lazy loading: Metadata loaded on first access via load_metadata(). Text metadata hydrated lazily on first attribute access.
Path resolution: corpus.path_* attributes resolved via __getattr__ → get_path()
Manifest: Corpora registered in manifest.txt (configparser). Multiple manifest files merged from package dir, ~/lltk_data/, and user config.
Metadata enrichment: Override load_metadata() → call super() → transform DataFrame → return
Cross-corpus links: LINKS dict + merge_linked_metadata() for joins, t.linked() for traversal
Parquet caching: Metadata CSVs cached as .parquet for 5-10x faster subsequent reads

CLI reference

lltk show                                # list available corpora
lltk install <corpus> [--parts ...]      # download corpus data
lltk compile <corpus>                    # compile corpus from raw sources
lltk preprocess <corpus> --parts txt     # XML→TXT conversion
lltk preprocess <corpus> --parts freqs   # TXT→word frequencies

lltk db-rebuild [corpus ...]             # rebuild DuckDB metadata store
lltk db-info                             # genre × corpus crosstab
lltk db-match [--fuzzy]                  # cross-corpus dedup matching
lltk db-enrich-genres                    # propagate genre from bibliographies
lltk db-wordcounts [-j N]               # compute word counts from freqs
lltk db-matches "title"                  # search matches by title
lltk db-match-stats                      # matching statistics

lltk annotate <corpus> [--port N]        # launch annotation web app

Development

Running tests

pip install pytest
python -m pytest tests/ -v
python -m pytest tests/ --cov=lltk --cov-report=term   # with coverage

199 tests using the test_fixture corpus (Blake, Austen, Shelley) checked into the repo — no external data needed.

Adding a new corpus

Create lltk/corpus/my_corpus/my_corpus.py:

from lltk.imports import *

class TextMyCorpus(BaseText):
    pass

class MyCorpus(BaseCorpus):
    TEXT_CLASS = TextMyCorpus

    def load_metadata(self):
        meta = super().load_metadata()
        # add/transform columns here
        return meta

[MyCorpus]
id = my_corpus
name = MyCorpus
desc = Description of the corpus
path_python = my_corpus/my_corpus.py
class_name = MyCorpus

Place data at ~/lltk_data/corpora/my_corpus/:
- metadata.csv — with id column + any metadata columns
- txt/ — text files as <text_id>.txt
- freqs/ — (optional) precomputed word frequencies as JSON

Available corpora

LLTK has built in functionality for the following corpora. Some (🌞) are freely downloadable from the links below or the LLTK interface. Some of them (☂) require first accessing the raw data through your institutional or other subscription. Some corpora have a mixture, with some data open through fair research use (e.g. metadata, freqs) and some closed (e.g. txt, xml, raw).

name	desc	license	metadata	freqs	txt	xml	raw
ARTFL	American and French Research on the Treasury of the French Language	Academic	☂️	☂️
BPO	British Periodicals Online	Commercial	☂️				☂️
CLMET	Corpus of Late Modern English Texts	Academic	🌞	🌞	☂️	☂️
COCA	Corpus of Contemporary American English	Commercial	☂️	☂️	☂️		☂️
COHA	Corpus of Historical American English	Commercial	☂️	☂️	☂️		☂️
Chadwyck	Chadwyck-Healey Fiction Collections	Mixed	🌞	🌞	☂️	☂️	☂️
ChadwyckDrama	Chadwyck-Healey Drama Collections	Mixed	☂️	☂️	☂️	☂️	☂️
ChadwyckPoetry	Chadwyck-Healey Poetry Collections	Mixed	☂️	☂️	☂️	☂️	☂️
Chicago	U of Chicago Corpus of C20 Novels	Academic	🌞	🌞	☂️
DTA	Deutsches Text Archiv	Free	🌞	🌞	🌞	🌞	🌞
DialNarr	Dialogue and Narration separated in Chadwyck-Healey Novels	Academic	🌞	🌞	☂️
EarlyPrint	EarlyPrint Project: EEBO/ECCO/Evans TCP with linguistic tagging	Free	🌞	🌞	🌞	🌞
ECCO	Eighteenth Century Collections Online	Commercial	☂️	☂️	☂️	☂️	☂️
ECCO_TCP	ECCO (Text Creation Partnership)	Free	🌞	🌞	🌞	🌞	🌞
EEBO_TCP	Early English Books Online (curated by the Text Creation Partnership)	Free	🌞	🌞	🌞	🌞
ESTC	English Short Title Catalogue (481K bibliographic records, metadata-only)	Academic	☂️
EnglishDialogues	A Corpus of English Dialogues, 1560-1760	Academic	🌞	🌞		🌞
EvansTCP	Early American Fiction	Free	🌞	🌞	🌞	🌞	🌞
GaleAmericanFiction	Gale American Fiction, 1774-1920	Academic	🌞	🌞	☂️		☂️
GildedAge	U.S. Fiction of the Gilded Age	Academic	🌞	🌞	🌞
HathiBio	Biographies from Hathi Trust	Academic	🌞	🌞
HathiEngLit	Fiction, drama, verse word frequencies from Hathi Trust	Academic	🌞	🌞
HathiEssays	Hathi Trust volumes with "essay(s)" in title	Academic	🌞	🌞
HathiLetters	Hathi Trust volumes with "letter(s)" in title	Academic	🌞	🌞
HathiNovels	Hathi Trust volumes with "novel(s)" in title	Academic	🌞	🌞
HathiProclamations	Hathi Trust volumes with "proclamation(s)" in title	Academic	🌞	🌞
HathiSermons	Hathi Trust volumes with "sermon(s)" in title	Academic	🌞	🌞
HathiStories	Hathi Trust volumes with "story/stories" in title	Academic	🌞	🌞
HathiTales	Hathi Trust volumes with "tale(s)" in title	Academic	🌞	🌞
HathiTreatises	Hathi Trust volumes with "treatise(s)" in title	Academic	🌞	🌞
InternetArchive	19th Century Novels, curated by the U of Illinois and hosted on the Internet Archive	Free	🌞	🌞	🌞
LitLab	Literary Lab Corpus of 18th and 19th Century Novels	Academic	🌞	🌞	☂️
MarkMark	Mark Algee-Hewitt's and Mark McGurl's 20th Century Corpus	Academic	🌞	🌞	☂️
OldBailey	Old Bailey Online	Free	🌞	🌞	🌞	🌞
RavenGarside	Raven & Garside's Bibliography of English Novels, 1770-1830	Academic	☂️
SOTU	State of the Union Addresses	Free	🌞	🌞	🌞
Sellers	19th Century Texts compiled by Jordan Sellers	Free	🌞	🌞	🌞
SemanticCohort	Corpus used in "Semantic Cohort Method" (2012)	Free	🌞
Spectator	The Spectator (1711-1714)	Free	🌞	🌞	🌞
TedJDH	Corpus used in "Emergence of Literary Diction" (2012)	Free	🌞	🌞	🌞
TxtLab	A multilingual dataset of 450 novels	Free	🌞	🌞	🌞		🌞

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

This version

0.7.1

Apr 12, 2026

0.5.15

Dec 2, 2021

0.5.14

Oct 27, 2021

0.5.13

Jul 13, 2021

0.5.12

Jun 9, 2021

0.5.11

May 4, 2021

0.5.10

Apr 28, 2021

0.5.9

Apr 28, 2021

0.5.8

Mar 31, 2021

0.5.7

Mar 31, 2021

0.5.6

Mar 31, 2021

0.5.5

Mar 30, 2021

0.5.4

Mar 21, 2021

0.5.3

Mar 21, 2021

0.5.2

Mar 21, 2021

0.5.1

Mar 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lltk_dh-0.7.1.tar.gz (6.0 MB view details)

Uploaded Apr 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lltk_dh-0.7.1-py3-none-any.whl (528.5 kB view details)

Uploaded Apr 12, 2026 Python 3

File details

Details for the file lltk_dh-0.7.1.tar.gz.

File metadata

Download URL: lltk_dh-0.7.1.tar.gz
Upload date: Apr 12, 2026
Size: 6.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for lltk_dh-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`250c8dc05f77996a82a47474f1fa50b9c4df66c4618d5852db79201172e06e8c`
MD5	`e4d4cbde43f24d12ce0ce104cadfadf0`
BLAKE2b-256	`487bfa66ca3ec0f8cdc1ef4abd54fe3fbd119038d60e8578e935c4cbbde0fc52`

See more details on using hashes here.

File details

Details for the file lltk_dh-0.7.1-py3-none-any.whl.

File metadata

Download URL: lltk_dh-0.7.1-py3-none-any.whl
Upload date: Apr 12, 2026
Size: 528.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for lltk_dh-0.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f1dab3634a24e08e14ab3789afbd305eb58c001db49e9073af629ae60ae65196`
MD5	`9851dd944871023a39bc7b278371053f`
BLAKE2b-256	`180ae656a1afa9440e9d6a92bcc01898bbc915147d443b332a1d93b4604a0309`

See more details on using hashes here.

lltk-dh 0.7.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Literary Language Toolkit (LLTK)

Install

Quick start

Installing corpus data

Texts

Sections

Corpus-level analysis

Most frequent words

Document-term matrix

Duplicate detection

Metadata

Loading metadata

Custom metadata loading

Cross-corpus linking

Metadata merging (many-to-one)

Text traversal (one-to-many)

Currently linked corpora

MetaDB (centralized DuckDB metadata store)

Building the database

Querying

Cross-corpus matching

SyntheticCorpus (virtual corpora from DB queries)

Architecture

Key patterns

CLI reference

Development

Running tests

Adding a new corpus

Available corpora

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes