Literary Language Toolkit (LLTK): corpora, models, and tools for the digital humanities
Project description
Literary Language Toolkit (LLTK)
A Python package for computational literary analysis and digital humanities research. 70+ literary corpora (English, French, German, Spanish), a ClickHouse analytical database for querying 2.8M+ texts across all sources, cross-corpus deduplication, multilingual passage search with embeddings, automated genre classification, language detection, and metrical scansion.
Package: lltk-dh on PyPI | License: MIT | Python: >=3.8
Install
pip install -U lltk-dh
# or latest from source:
pip install -U git+https://github.com/quadrismegistus/lltk
Optional extras:
pip install "lltk-dh[embeddings]" # sentence-transformers + torch for semantic search
pip install "lltk-dh[analysis]" # scipy for statistical analysis
Quick start
import lltk
# List available corpora
lltk.show()
# Load a corpus
c = lltk.load('ecco_tcp')
# Metadata as a pandas DataFrame
c.meta
c.meta.query('1770 < year < 1830')
# Iterate texts
for t in c.texts():
print(t.id, t.author, t.title, t.year)
print(t.text_plain()[:200])
print(t.freqs()) # word frequencies (Counter)
# Corpus-level analysis
mfw = c.mfw(n=10000) # top 10K words across corpus
dtm = c.dtm(n=10000) # document-term matrix (DataFrame)
dtm = c.dtm(n=10000, tfidf=True) # TF-IDF weighted
Installing corpus data
Corpora live at ~/lltk_data/corpora/<corpus_id>/. Each has: metadata.csv, txt/, and optionally xml/, freqs/. Some corpora are freely downloadable; others require institutional access.
# Download a corpus (metadata + freqs)
lltk install ecco_tcp --parts metadata,freqs
# Full texts
lltk install ecco_tcp --parts txt
The centralized database
The core of LLTK is a ClickHouse analytical database (lltk.db) that indexes all corpora into a single queryable store. It enables sub-second queries across 2.8M texts, cross-corpus deduplication, genre enrichment from bibliography corpora, language detection, and virtual corpus construction.
Building the database
lltk db-rebuild # ingest all corpus CSVs -> lltk.texts
lltk db-freqs # ingest per-text word frequencies
lltk db-text-words # build flat word index for analytics
lltk db-match # cross-corpus dedup matching (~2 min)
lltk db-enrich-genres # propagate genre from bibliographies
lltk db-detect-langs # per-text language detection
lltk db-detect-translations # flag translations via match groups
lltk db-info # genre x corpus crosstab
Querying
import lltk
# Single-row lookup
lltk.db.get('_estc/T012345')
# SQL queries on the texts table
lltk.db.query("SELECT * FROM texts WHERE year < 1700 AND genre = 'Fiction'")
lltk.db.query("SELECT corpus, COUNT(*) as n FROM texts GROUP BY corpus")
# Iterate text objects with filters + dedup
for t in lltk.db.texts(genre='Fiction', year_min=1600, year_max=1800, dedup=True):
print(t.corpus.id, t.title, t.year)
print(t.freqs()) # resolves through source corpus
# As DataFrame
df = lltk.db.texts_df(genre='Fiction', dedup=True)
# Ngram frequencies (with dedup and genre filtering)
lltk.db.ngram(['virtue', 'honor'], genre='Fiction', dedup=True)
Text objects returned by lltk.db.texts() keep their original corpus reference, so t.text_plain(), t.freqs(), and file paths all resolve through the source corpus.
Cross-corpus matching
Matching finds duplicate and reprint texts across corpora via multiple tiers:
| Tier | Method | Description |
|---|---|---|
| 0 | id_link |
Shared IDs from declared cross-corpus links |
| 1a | exact_norm |
Normalized title + author |
| 1b | exact_norm_year |
Normalized title + year (authorless texts) |
| 2a | containment |
Short title within long title, same author |
| 2b | containment_year |
Same, by year |
| 3 | fuzzy_title |
Jaro-Winkler > 0.85 (opt-in with --fuzzy) |
Connected components are grouped and ranked by corpus source preference. Normalization includes MorphAdorner spelling modernization (358K entries for early modern English).
lltk.db.match() # exact + containment matching
lltk.db.find_matches('Incognita') # search match groups by title
Full-text and semantic search
LLTK splits texts into ~500-word passages and indexes them for search:
lltk db-passages # build passage chunks
lltk search "virtue AND honor" # full-text search (FTS5)
lltk search "NEAR(virtue honor, 5)" # proximity search
# Full-text search with filters
results = lltk.db.search('virtue', genre='Fiction', year_min=1700, year_max=1800)
# Semantic search (requires embeddings extra)
results = lltk.db.search_semantic('concept of honor in battle')
Passage embeddings use intfloat/multilingual-e5-large and support cross-lingual queries.
lltk db-embed-passages # compute embeddings (GPU recommended)
lltk db-match-embeddings # find duplicates via embedding similarity
Annotations
A priority-based annotation system for storing and resolving metadata across multiple sources (human labels, bibliographies, LLM predictions):
from lltk.tools import annotations as A
# Write annotations
A.write(source='llm:gemini-2.5-pro', rows=[
{'_id': '_estc/T068056', 'field': 'genre', 'value': 'Fiction', 'confidence': 0.95}
])
# Resolve: highest-priority source wins per (text, field)
A.resolve(ids=['_estc/T068056'], fields=['genre'])
# Find disagreements between sources
A.disagreements('genre', min_sources=2)
Source priorities: human (100) > bibliography (90) > authority corpus (70) > heuristic (50) > LLM (10).
Texts
c = lltk.load('ecco_tcp')
for t in c.texts():
t.id # text identifier
t.author # metadata attributes
t.title
t.year
t.text_plain() # plain text as string
t.xml # XML source (if available)
t.freqs() # word frequencies (Counter)
# Direct access by ID
t = c.text('some_text_id')
Sections
Texts can be split into structural sections (chapters, letters, etc.) from XML, or into paragraphs and fixed-length passages:
for ch in t.chapters.texts():
print(ch.get('title'), ch.text_plain()[:100])
for p in t.paragraphs.texts():
print(p.id, len(p.text_plain()))
Prosodic analysis
Optional integration with prosodic (>=3.1) for metrical scansion:
lltk prosodic-parse ecco_tcp # parse a corpus
lltk prosodic-aggregate ecco_tcp # build prosodic.parquet
t.prosodic(cached=True) # per-text scansion data
Corpus-level analysis
Document-term matrix
dtm = c.dtm(n=10000) # raw counts (DataFrame)
dtm = c.dtm(n=10000, tf=True) # term frequencies
dtm = c.dtm(n=10000, tfidf=True) # TF-IDF weighted
Returns a pandas DataFrame: rows = text IDs, columns = words.
Virtual corpora (CuratedCorpus)
Declarative corpus classes that pull texts from multiple sources with filters and deduplication:
from lltk.corpus.arc_corpora.arc_corpora import ArcFiction
c = lltk.load('arc_fiction')
c.meta # all English fiction, deduplicated across 10+ source corpora
Built-in curated corpora include ArcFiction, ArcPoetry, ArcFictionFr, ArcFictionDe, ArcBiography, ArcEssays, ArcSermons, and ArcPeriodical.
Define your own:
from lltk.corpus.arc_corpora.arc_corpora import CuratedCorpus
class MyFiction(CuratedCorpus):
ID = 'my_fiction'
NAME = 'MyFiction'
SOURCES = {
'chadwyck': {'genre': 'Fiction'},
'ecco_tcp': {'genre': 'Fiction'},
'hathi_englit': {'genre': 'Fiction', 'year_max': 1900},
}
DEDUP = True
DEDUP_BY = 'oldest'
CLI reference
Corpus management:
lltk show list available corpora
lltk status check install status of all corpora
lltk info <corpus> corpus details
lltk install <corpus> [--parts ...] download corpus data
lltk compile <corpus> compile corpus from raw sources
lltk preprocess <corpus> --parts ... XML->TXT, TXT->freqs
Database (ClickHouse):
lltk db-rebuild [corpus ...] ingest corpus CSVs -> lltk.texts
lltk db-freqs [corpus ...] ingest per-text freqs JSONs
lltk db-text-words [corpus ...] build flat word index
lltk db-wordindex [--vocab-size N] build aggregation tables
lltk db-info genre x corpus crosstab
Matching & dedup:
lltk db-match [--fuzzy] cross-corpus dedup matching
lltk db-matches "title" search match groups
lltk db-match-stats matching statistics
lltk db-match-embeddings embedding-based matching
Genre & language:
lltk db-enrich-genres propagate genre from bibliographies
lltk db-tag-genres materialize genre tags from annotations
lltk db-detect-langs per-text language detection
lltk db-detect-translations flag translations via match groups
Search & embeddings:
lltk search "query" [--genre ...] full-text passage search
lltk db-passages [corpus ...] build passage chunks
lltk db-embed-passages [corpus ...] compute passage embeddings
Prosodic:
lltk prosodic-parse <corpus> metrical scansion
lltk prosodic-aggregate <corpus> build prosodic.parquet
Web:
lltk app [--port N] launch explorer web app
lltk annotate <corpus> [--port N] launch annotation interface
Architecture
lltk/
+-- cli.py # CLI entry point
+-- text/
| +-- text.py # BaseText, TextSection, Text() factory
| +-- textlist.py # TextList collection class
+-- corpus/
| +-- corpus.py # BaseCorpus, SectionCorpus, Corpus() factory
| +-- synthetic.py # SyntheticCorpus -- virtual corpora from DB queries
| +-- arc_corpora/ # CuratedCorpus subclasses (ArcFiction, etc.)
| +-- manifest.txt # Corpus registry (71 entries)
| +-- <corpus_name>/ # Per-corpus implementations
+-- tools/
| +-- metadb_ch.py # MetaDBCH -- ClickHouse-backed lltk.db singleton
| +-- annotations.py # Priority-based annotation system
| +-- genre_tags.py # Genre tag materialization
| +-- clickhouse_*.py # CH build/query modules (match, rebuild, embeddings, ...)
| +-- prosodic_tools.py # Prosodic integration
| +-- db_adapter.py # Database adapter abstraction
+-- web/
+-- app.py # Explorer web app (FastAPI + Svelte)
+-- annotate.py # Annotation interface
Key patterns:
- Inheritance:
BaseObject->TextList->BaseCorpus-> corpus subclasses - Factories:
Text(id)andCorpus(id)return cached objects - Lazy hydration: Text metadata loaded from CH on first attribute access, CSV fallback
- Path resolution:
corpus.path_*attributes resolved via__getattr__->get_path() - Manifest: Corpora registered in
manifest.txt(configparser); merged from package dir +~/lltk_data/+ user config - Parquet caching: Metadata CSVs cached as
.parquetfor 5-10x faster subsequent reads
Development
Running tests
pip install pytest
python -m pytest tests/ -v
python -m pytest tests/ --cov=lltk --cov-report=term
374 tests using the test_fixture corpus (Blake, Austen, Shelley) checked into the repo -- no external data needed.
Adding a new corpus
- Create
lltk/corpus/my_corpus/my_corpus.py:
from lltk.imports import *
class TextMyCorpus(BaseText):
pass
class MyCorpus(BaseCorpus):
TEXT_CLASS = TextMyCorpus
def load_metadata(self):
meta = super().load_metadata()
# add/transform columns here
return meta
- Register in
lltk/corpus/manifest.txt:
[MyCorpus]
id = my_corpus
name = MyCorpus
desc = Description of the corpus
path_python = my_corpus/my_corpus.py
class_name = MyCorpus
- Place data at
~/lltk_data/corpora/my_corpus/:metadata.csvwithidcolumn + any metadata columnstxt/text files as<text_id>.txtfreqs/(optional) precomputed word frequencies as JSON
Available corpora
71 corpora across English, French, German, and Spanish. Some are freely downloadable, others require institutional access.
English
| Corpus | Description | Period | License |
|---|---|---|---|
| EarlyPrint | EEBO/ECCO/Evans TCP with linguistic tagging (~60K texts) | 1473-1800 | Free |
| EEBO_TCP | Early English Books Online (TCP) | 1473-1700 | Free |
| ECCO_TCP | Eighteenth Century Collections Online (TCP) | 1701-1800 | Free |
| ECCO | Eighteenth Century Collections Online (full) | 1701-1800 | Commercial |
| ESTC | English Short Title Catalogue (481K bib. records) | 1473-1800 | Academic |
| Chadwyck | Chadwyck-Healey Fiction, Drama, Poetry | 1500-1900 | Mixed |
| HathiEngLit | Hathi Trust fiction, drama, verse | 1700-1900 | Academic |
| InternetArchive | 19th Century Novels (U of Illinois) | 1800-1900 | Free |
| GaleAmericanFiction | Gale American Fiction | 1774-1920 | Academic |
| OldBailey | Old Bailey trial proceedings | 1674-1913 | Free |
| CLMET | Corpus of Late Modern English Texts | 1710-1920 | Academic |
| COCA | Corpus of Contemporary American English | 1990-2019 | Commercial |
| COHA | Corpus of Historical American English | 1820-2019 | Commercial |
| Spectator | The Spectator (1711-1714) | 1711-1714 | Free |
| SOTU | State of the Union Addresses | 1790-2017 | Free |
Plus: BPO, Chicago, DialNarr, EnglishDialogues, EvansTCP, GildedAge, LitLab, MarkMark, Sellers, SemanticCohort, TedJDH, and genre-specific Hathi subcorpora (Bio, Essays, Letters, Novels, Sermons, Stories, Tales, Treatises, Proclamations, Almanacs, Romances).
Bibliography & reference
| Corpus | Description |
|---|---|
| FictionBiblio | 6,862 entries from 6 fiction bibliographies (1475-1799) |
| RavenGarside | Bibliography of English Novels, 1770-1830 |
| END | Early Novels Database: 2,002 MARCXML records (1660-1830) |
French
| Corpus | Description | Size | License |
|---|---|---|---|
| ARTFL | Treasury of the French Language | 3.6K | Academic |
| FrenchPDBooks | French public domain books | 290K | Free |
| Gallica | Gallica literary fictions | 15.5K | Free |
| PAIGE | French fiction corpus | 3.2K | Academic |
German
| Corpus | Description | Size | License |
|---|---|---|---|
| DTA | Deutsches Text Archiv | 3.3K | Free (CC BY-SA) |
| GermanPD | German public domain texts | 275K | Free |
| GermanFiction | Curated German literary fiction (1600-1799) | 140 | Academic |
| DeCorp | German fiction corpus | ~5K | Academic |
Multilingual & other
| Corpus | Description |
|---|---|
| TxtLab | 450 novels in English, French, and German |
| SpanishPDBooks | Spanish public domain books |
| ImpactES | Spanish historical texts |
Curated virtual corpora
These combine and deduplicate texts from multiple source corpora:
| Corpus | Description |
|---|---|
arc_fiction |
English fiction across all sources, deduplicated |
arc_poetry |
English poetry across all sources |
arc_fiction_fr |
French fiction across all sources |
arc_fiction_de |
German fiction across all sources |
arc_biography |
English biography |
arc_essays |
English essays |
arc_sermons |
English sermons |
arc_periodical |
English periodicals |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lltk_dh-0.9.0.tar.gz.
File metadata
- Download URL: lltk_dh-0.9.0.tar.gz
- Upload date:
- Size: 329.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a440bc4ff6d8e4ba56a32a5a2285a78de672799d06f8a64f9f29fc33cf50de10
|
|
| MD5 |
d6e4e78f13bc1572b1c63cd9856c2d6f
|
|
| BLAKE2b-256 |
5c327a8710234c2e55b4b870dd0be13ba1d5c6e24f8540c9aeaf76e08fd97155
|
File details
Details for the file lltk_dh-0.9.0-py3-none-any.whl.
File metadata
- Download URL: lltk_dh-0.9.0-py3-none-any.whl
- Upload date:
- Size: 342.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
565f2e499d1357115cb1db94cf3693426fda8b5c62a0634065b70b588dfe9c59
|
|
| MD5 |
ed37153df6ceab5fefb12928b1f404be
|
|
| BLAKE2b-256 |
b19c79586f1b18c54a7ce7b4be6761f93a45d6b96659d0de0bafd1a7a457188d
|