Literary Language Toolkit (LLTK): corpora, models, and tools for the digital humanities
Project description
Literary Language Toolkit (LLTK)
A Python package for computational literary analysis and digital humanities research. Provides 50+ literary corpora, text processing tools, and analysis methods including word frequencies, document-term matrices, cross-corpus linking, deduplication, and a centralized DuckDB metadata store for querying 2M+ texts across all corpora.
Package: lltk-dh on PyPI | License: MIT | Python: >=3.8
Install
pip install -U lltk-dh
# or latest from source:
pip install -U git+https://github.com/quadrismegistus/lltk
Quick start
import lltk
# List available corpora
lltk.show()
# Load a corpus
c = lltk.load('ecco_tcp')
# Metadata as a pandas DataFrame
c.meta
c.meta.query('1770 < year < 1830')
# Iterate texts
for t in c.texts():
print(t.id, t.author, t.title, t.year)
print(t.txt[:200])
print(t.freqs()) # word frequencies (Counter)
# Corpus-level analysis
mfw = c.mfw(n=10000) # top 10K words across corpus
dtm = c.dtm(n=10000) # document-term matrix (DataFrame)
dtm = c.dtm(n=10000, tfidf=True) # TF-IDF weighted
Installing corpus data
Corpora live at ~/lltk_data/corpora/<corpus_id>/. Each has: metadata.csv, txt/, and optionally xml/, freqs/. Some corpora are freely downloadable; others require institutional access.
# Download a corpus (metadata + freqs ≈ 150 MB for ecco_tcp)
lltk install ecco_tcp --parts metadata,freqs
# Full texts (adds ~600 MB)
lltk install ecco_tcp --parts txt
Texts
for t in c.texts():
t.id # text identifier
t.author # metadata attributes
t.title
t.year
t.txt # plain text as string
t.xml # XML source (if available)
t.freqs() # word frequencies (Counter)
# Direct access by ID
t = c.text('some_text_id')
Sections
Texts can be split into structural sections (chapters, letters, etc.) from XML, or into paragraphs and fixed-length passages:
# Chapters from XML (auto-detects <div>, <chapter>, <letter>, etc.)
for ch in t.chapters.texts():
print(ch.get('title'), ch.txt[:100])
# Paragraphs (split on blank lines)
for p in t.paragraphs.texts():
print(p.id, len(p.txt))
# Passages of ~500 words (respects sentence boundaries)
for p in t.passages(n=500).texts():
print(p.id, p.get('num_words'))
print(p.freqs())
Sections are TextSection objects inside a SectionCorpus — they support all the same methods as regular texts (txt, freqs(), meta, etc.).
Corpus-level analysis
Most frequent words
mfw = c.mfw(n=10000) # top 10K words across corpus (list)
Document-term matrix
dtm = c.dtm(n=10000) # raw counts (DataFrame)
dtm = c.dtm(n=10000, tf=True) # term frequencies
dtm = c.dtm(n=10000, tfidf=True) # TF-IDF weighted
Returns a pandas DataFrame: rows = text IDs, columns = words.
Duplicate detection
Find near-duplicate texts within a corpus using cosine similarity on TF-IDF word frequency vectors. Works even on corpora with only precomputed freqs/ (no raw text needed).
dupes = c.find_duplicates(
n=5000, # number of MFW features
threshold=0.85, # minimum cosine similarity
k=10, # max neighbors per text
)
# Returns DataFrame: id_1, id_2, similarity (sorted descending)
Metadata
Loading metadata
c.meta returns a pandas DataFrame loaded from the corpus's metadata.csv. Corpus subclasses can override load_metadata() to enrich columns without altering the CSV:
c = lltk.load('estc')
c.meta # includes enriched columns: format_std, num_pages, is_fiction, etc.
Custom metadata loading
Override load_metadata() in a corpus subclass:
class MyCorpus(BaseCorpus):
def load_metadata(self):
meta = super().load_metadata()
meta['genre'] = 'Fiction'
meta['decade'] = (meta['year'] // 10) * 10
return meta
Results are cached — subsequent calls to c.meta or c.load_metadata() return the cached DataFrame.
Cross-corpus linking
Corpora can declare shared-ID relationships for linking texts across collections. This supports two patterns:
Metadata merging (many-to-one)
When many texts in corpus A map to one record in corpus B (e.g., many ECCO texts → one ESTC catalogue record), corpus A can merge B's metadata as prefixed columns:
c = lltk.load('ecco')
c.meta
# DataFrame includes: ESTCID, estc_author, estc_format_std, estc_is_fiction, ...
This is configured declaratively on the corpus class:
class ECCO(BaseCorpus):
LINKS = {'estc': ('ESTCID', 'id_estc')} # my_col → their_col
def load_metadata(self):
meta = super().load_metadata()
return self.merge_linked_metadata(meta)
Text traversal (one-to-many)
When one record maps to many texts in another corpus (e.g., one ESTC record → multiple ECCO editions), use t.linked():
c = lltk.load('estc')
t = c.text('T089174')
# Find all ECCO texts for this ESTC record
ecco_texts = t.linked('ecco')
for et in ecco_texts:
print(et.id, et.title, et.year)
# Find all EEBO texts
eebo_texts = t.linked('eebo_tcp')
Currently linked corpora
| Source | Target | Link column | Direction |
|---|---|---|---|
| ECCO | ESTC | ESTCID → id_estc |
metadata merge |
| EEBO_TCP | ESTC | id_stc → id_estc |
metadata merge |
| ESTC | ECCO | id_estc → ESTCID |
text traversal |
| ESTC | EEBO_TCP | id_estc → id_stc |
text traversal |
MetaDB (centralized DuckDB metadata store)
lltk.db is a DuckDB-backed metadata cache that indexes all corpora into a single queryable store. It enables fast single-row lookups, cross-corpus queries, title/author matching for deduplication, genre enrichment from bibliography corpora, and virtual corpus construction.
Building the database
lltk db-rebuild # ingest all corpora (~4 min)
lltk db-rebuild estc ecco # re-ingest specific corpora
lltk db-info # genre × corpus crosstab
lltk db-match # cross-corpus dedup matching (~5 min)
lltk db-enrich-genres # propagate genre from bibliographies
lltk db-wordcounts # compute word counts from freqs
Querying
import lltk
# Single-row lookup
lltk.db.get('_estc/T012345')
# SQL queries on the texts table
lltk.db.query("SELECT * FROM texts WHERE year < 1700 AND genre = 'Fiction'")
lltk.db.query("SELECT corpus, COUNT(*) as n FROM texts GROUP BY corpus")
# Iterate text objects with filters + dedup
for t in lltk.db.texts(genre='Fiction', year_min=1600, year_max=1800, dedup=True):
print(t.corpus.id, t.title, t.year)
print(t.freqs()) # resolves through source corpus
# As DataFrame (no text objects)
df = lltk.db.texts_df(genre='Fiction', dedup=True)
# As a corpus object (supports .mfw, .dtm, .meta)
fiction = lltk.db.corpus(genre='Fiction', dedup=True)
Text objects returned by lltk.db.texts() keep their original corpus reference, so t.txt, t.freqs(), and file paths all resolve through the source corpus.
Cross-corpus matching
Matching finds duplicate/reprint texts across corpora via multiple tiers: exact title+author, containment (short title within long title), and optional fuzzy matching (Jaro-Winkler). Connected components are grouped and ranked by corpus source preference.
lltk.db.match() # exact + containment matching
lltk.db.find_matches('Incognita') # search match groups by title
lltk.db.get_group('_estc/T012345') # all texts in same match group
SyntheticCorpus (virtual corpora from DB queries)
Declarative corpus class that pulls texts from multiple source corpora, deduplicated:
from lltk.corpus.synthetic import SyntheticCorpus
class BigFiction(SyntheticCorpus):
ID = 'big_fiction'
NAME = 'BigFiction'
SOURCES = {
'chadwyck': {'genre': 'Fiction'},
'ecco': {'genre': 'Fiction'},
'hathi_englit': {'genre': 'Fiction'},
}
DEDUP = True
DEDUP_BY = 'oldest'
C = BigFiction()
C.meta # DataFrame — all fiction, deduplicated
for t in C.texts():
t.txt[:100] # resolves through source corpus paths
Architecture
lltk/
├── imports.py # Constants, config, third-party imports
├── __init__.py # Package entry point
├── text/
│ ├── text.py # BaseText, TextSection, Text() factory
│ ├── textlist.py # TextList collection class
│ └── utils.py # Tokenization, XML parsing, text utilities
├── corpus/
│ ├── corpus.py # BaseCorpus, SectionCorpus, Corpus() factory
│ ├── synthetic.py # SyntheticCorpus — virtual corpora from DuckDB queries
│ ├── utils.py # load_corpus(), manifest loading, corpus discovery
│ ├── manifest.txt # Corpus registry (configparser format)
│ └── <corpus_name>/ # Per-corpus implementations (50+)
├── model/
│ ├── preprocess.py # Preprocessing (XML→TXT, TXT→freqs)
│ ├── matcher.py # Text matching/dedup
│ └── ... # word2vec, doc2vec, networks, etc.
└── tools/
├── baseobj.py # BaseObject (root class)
├── tools.py # Config, utilities, parallel processing
├── db.py # Local DB backends
├── metadb.py # DuckDB centralized metadata store (lltk.db)
└── logs.py # Logging
Key patterns
- Inheritance:
BaseObject→TextList→BaseCorpus→ corpus subclasses - Factories:
Text(id)andCorpus(id)return cached objects - Lazy loading: Metadata loaded on first access via
load_metadata(). Text metadata hydrated lazily on first attribute access. - Path resolution:
corpus.path_*attributes resolved via__getattr__→get_path() - Manifest: Corpora registered in
manifest.txt(configparser). Multiple manifest files merged from package dir,~/lltk_data/, and user config. - Metadata enrichment: Override
load_metadata()→ callsuper()→ transform DataFrame → return - Cross-corpus links:
LINKSdict +merge_linked_metadata()for joins,t.linked()for traversal - Parquet caching: Metadata CSVs cached as
.parquetfor 5-10x faster subsequent reads
CLI reference
lltk show # list available corpora
lltk install <corpus> [--parts ...] # download corpus data
lltk compile <corpus> # compile corpus from raw sources
lltk preprocess <corpus> --parts txt # XML→TXT conversion
lltk preprocess <corpus> --parts freqs # TXT→word frequencies
lltk db-rebuild [corpus ...] # rebuild DuckDB metadata store
lltk db-info # genre × corpus crosstab
lltk db-match [--fuzzy] # cross-corpus dedup matching
lltk db-enrich-genres # propagate genre from bibliographies
lltk db-wordcounts [-j N] # compute word counts from freqs
lltk db-matches "title" # search matches by title
lltk db-match-stats # matching statistics
lltk annotate <corpus> [--port N] # launch annotation web app
Development
Running tests
pip install pytest
python -m pytest tests/ -v
python -m pytest tests/ --cov=lltk --cov-report=term # with coverage
199 tests using the test_fixture corpus (Blake, Austen, Shelley) checked into the repo — no external data needed.
Adding a new corpus
- Create
lltk/corpus/my_corpus/my_corpus.py:
from lltk.imports import *
class TextMyCorpus(BaseText):
pass
class MyCorpus(BaseCorpus):
TEXT_CLASS = TextMyCorpus
def load_metadata(self):
meta = super().load_metadata()
# add/transform columns here
return meta
- Register in
lltk/corpus/manifest.txt:
[MyCorpus]
id = my_corpus
name = MyCorpus
desc = Description of the corpus
path_python = my_corpus/my_corpus.py
class_name = MyCorpus
- Place data at
~/lltk_data/corpora/my_corpus/:metadata.csv— withidcolumn + any metadata columnstxt/— text files as<text_id>.txtfreqs/— (optional) precomputed word frequencies as JSON
Available corpora
LLTK has built in functionality for the following corpora. Some (🌞) are freely downloadable from the links below or the LLTK interface. Some of them (☂) require first accessing the raw data through your institutional or other subscription. Some corpora have a mixture, with some data open through fair research use (e.g. metadata, freqs) and some closed (e.g. txt, xml, raw).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lltk_dh-0.7.1.tar.gz.
File metadata
- Download URL: lltk_dh-0.7.1.tar.gz
- Upload date:
- Size: 6.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
250c8dc05f77996a82a47474f1fa50b9c4df66c4618d5852db79201172e06e8c
|
|
| MD5 |
e4d4cbde43f24d12ce0ce104cadfadf0
|
|
| BLAKE2b-256 |
487bfa66ca3ec0f8cdc1ef4abd54fe3fbd119038d60e8578e935c4cbbde0fc52
|
File details
Details for the file lltk_dh-0.7.1-py3-none-any.whl.
File metadata
- Download URL: lltk_dh-0.7.1-py3-none-any.whl
- Upload date:
- Size: 528.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1dab3634a24e08e14ab3789afbd305eb58c001db49e9073af629ae60ae65196
|
|
| MD5 |
9851dd944871023a39bc7b278371053f
|
|
| BLAKE2b-256 |
180ae656a1afa9440e9d6a92bcc01898bbc915147d443b332a1d93b4604a0309
|