LLMs for digital humanities: structured data extraction from unstructured texts
Project description
Large Literary Models
A Python toolkit for using Large Language Models (LLMs) to produce structured, annotated data from unstructured texts. Built for digital humanities research.
What it does: You give it messy text (OCR scans, bibliographies, novel excerpts, archival documents) and a description of the structured data you want back (characters, citations, sentiments, relationships). It sends the text to an LLM, parses the response into clean structured data, caches everything so you never pay for the same query twice, and hands you back validated Python objects you can export to CSV or a pandas DataFrame.
Supported LLM providers: Claude (Anthropic), GPT (OpenAI), Gemini (Google), and any OpenAI-compatible local server (vLLM, LM Studio, Ollama, llama.cpp).
Table of Contents
- Installation
- Setup: API Keys
- Quick Start
- Structured Extraction
- Defining a Task
- Working with Multiple Prompts
- Sequential Tasks
- Caching
- Using local models (vLLM, LM Studio, Ollama)
- CLI: litmod
- Running at Scale
- Relationship to LLTK
- Example: Bibliography Extraction
- Starting Your Own Project
- Available Tasks
- Project Structure
Installation
Prerequisites
You need Python 3.10 or later.
Install from PyPI
pip install largeliterarymodels
This installs largeliterarymodels and all its dependencies: the Anthropic, OpenAI, and Google AI client libraries, pydantic for structured data extraction, and HashStash for caching.
Install from source (for development)
git clone https://github.com/quadrismegistus/largeliterarymodels.git
cd largeliterarymodels
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
Setup: API Keys
To use an LLM, you need an API key from at least one provider. You only need one, but having multiple lets you compare models.
| Provider | Get a key at | Environment variable |
|---|---|---|
| Anthropic (Claude) | console.anthropic.com | ANTHROPIC_API_KEY |
| OpenAI (GPT) | platform.openai.com | OPENAI_API_KEY |
| Google (Gemini) | aistudio.google.com | GEMINI_API_KEY |
Set them in your shell:
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
export OPENAI_API_KEY="sk-your-key-here"
export GEMINI_API_KEY="your-key-here"
For local models (vLLM, LM Studio, Ollama), no API key is needed.
Quick Start
Basic text generation
from largeliterarymodels import LLM
llm = LLM() # defaults to Claude Sonnet
response = llm.generate("What is the plot of Pamela by Samuel Richardson?")
print(response)
The response is automatically cached. Running the exact same call again returns instantly without using API credits.
Structured extraction
This is the core feature. Define a schema using pydantic, and the LLM fills it in:
from pydantic import BaseModel, Field
from largeliterarymodels import LLM
class Sentiment(BaseModel):
sentiment: str = Field(description="positive, negative, or neutral")
confidence: float = Field(description="confidence score from 0.0 to 1.0")
explanation: str = Field(description="one-sentence explanation")
llm = LLM()
result = llm.extract(
"It was the best of times, it was the worst of times.",
schema=Sentiment,
)
print(result.sentiment) # "neutral"
print(result.confidence) # 0.75
print(result.explanation) # "The passage juxtaposes extremes..."
The result is a validated Python object with dot-notation access, not a string or raw JSON.
Extracting a list of items
Use list[YourModel] when you expect multiple results:
class Character(BaseModel):
name: str
role: str = Field(description="role in the narrative")
gender: str
characters = llm.extract(
"Who are the main characters in Pride and Prejudice?",
schema=list[Character],
)
for c in characters:
print(f"{c.name} ({c.gender}): {c.role}")
Defining a Task
A Task bundles together everything needed for a specific extraction job: the schema, system prompt, examples, and configuration.
from pydantic import BaseModel, Field
from largeliterarymodels import Task
class Character(BaseModel):
name: str
gender: str
role: str = Field(description="role in the narrative")
prominence: int = Field(description="1-10 prominence score")
class CharacterTask(Task):
schema = list[Character]
system_prompt = "You are a literary scholar. Extract all named characters."
examples = [
(
"Mr. Darcy danced with Elizabeth at the ball.",
[
Character(name="Mr. Darcy", gender="Male", role="love interest", prominence=9),
Character(name="Elizabeth", gender="Female", role="protagonist", prominence=10),
],
),
]
retries = 2
task = CharacterTask()
characters = task.run(chapter_text)
characters_gpt = task.run(chapter_text, model="gpt-4o-mini")
Working with Multiple Prompts
task = CharacterTask()
results = task.map(
[chapter_1_text, chapter_2_text, chapter_3_text],
model="claude-sonnet-4-6",
)
map() runs requests in parallel (4 threads by default), shows a progress bar, and caches results. Re-running the same batch skips already-cached prompts.
Sequential Tasks
For long texts that need to be processed in chunks with rolling context (e.g., extracting a social network from a 300-page novel), use SequentialTask:
from largeliterarymodels.tasks import SocialNetworkTask
task = SocialNetworkTask(model="vllm/qwen3.6-27b")
# Pass a list of passage strings
result = task.run(passages, cache_key="my_text_id")
# Or pass a .txt file path (auto-chunked)
result = task.run("novel.txt", cache_key="novel")
SequentialTask.run() processes passages in chunks, feeding forward a rolling state (e.g., the character roster so far) to maintain consistency across chunks. Results are cached per-chunk, so interrupted runs resume where they left off.
Key parameters:
source: list of passage strings, or path to a .txt filecache_key: stable identifier for caching (e.g., a text ID)save: path to save the aggregated JSON result, orFalseto skip
Caching
All LLM calls are automatically cached using HashStash. The cache key includes the prompt, model, system prompt, temperature, and schema. Same inputs = instant cached result.
Cache is stored in data/stash/ inside the repository. To force a fresh generation:
response = llm.generate("What is the plot of Pamela?", force=True)
Provider-side prompt caching (Anthropic, OpenAI) is enabled automatically for repeat calls within a batch, cutting input costs ~10x on long system prompts.
Using local models (vLLM, LM Studio, Ollama)
Any OpenAI-compatible local inference server works:
from largeliterarymodels import LLM
llm = LLM(model="lmstudio/qwen3.5-35b-a3b")
llm = LLM(model="vllm/qwen3.6-27b")
llm = LLM(model="ollama/mistral")
By default, local models connect to http://localhost:11434/v1 (Ollama's default). Override with:
export LOCAL_BASE_URL="http://localhost:8000/v1" # vLLM
export LOCAL_BASE_URL="http://localhost:1234/v1" # LM Studio
CLI: litmod
The package includes a CLI tool:
litmod ls # list available tasks
litmod show GenreTask # show task schema + fixtures
litmod smoke GenreTask --model sonnet # test on built-in fixtures
litmod run GenreTask --input data/manifest.csv --model sonnet
litmod annotate GenreTask --port 8989 # human annotation web app
Cloud GPU management (Vast.ai)
For running sequential tasks at scale on rented GPUs:
litmod cloud launch # find + rent cheapest A100 80GB (~$0.85/hr)
litmod cloud setup # install vLLM + largeliterarymodels over SSH
litmod cloud upload passages_c19 # rsync passage files to instance
litmod cloud run passages_c19 # start vLLM + batch in tmux (survives disconnects)
litmod cloud status # check progress, running cost, tail log
litmod cloud download # rsync results back locally
litmod cloud stop # destroy instance (stops all billing)
litmod cloud ssh # interactive shell access
State is persisted in .vastai.json so everything is resumable across disconnects. Requires a Vast.ai account and API key (pip install vastai && vastai set api-key YOUR_KEY).
Running at Scale
For processing hundreds or thousands of texts, the workflow splits across two environments:
On GPU (Colab, Vast.ai, or HPC)
-
Export passages locally (where your database is):
python scripts/hpc/export_passages.py --subcollection Nineteenth-Century_Fiction --out data/passages_c19
-
Upload and run on the GPU:
# Vast.ai litmod cloud upload passages_c19 litmod cloud run passages_c19 # Or Colab: upload JSONL files, then: python scripts/batch_social_network.py --text-dir passages/ --output-dir results/ --model vllm-qwen36 --workers 4
-
Download results and ingest locally:
litmod cloud download lltk ingest-tasks social_network data/cloud_results/passages_c19/
The batch script handles resume-from-failure (skips texts with existing output), parallel workers, and sharding across multiple processes.
Passage export format
Exported files are JSONL with an _id metadata header:
{"_id": "_chadwyck/Nineteenth-Century_Fiction/ncf0101.01", "_n_passages": 298}
{"seq": 0, "text": "CHAPTER I. In which the reader...", "n_words": 487}
{"seq": 1, "text": "The morning was bright and clear...", "n_words": 512}
The _id in the header is authoritative for result placement -- filenames are slugified and not used for identity.
Relationship to LLTK
This package is designed to work with LLTK (Literary Language Toolkit) but does not depend on it. The division of labor:
| largeliterarymodels | lltk | |
|---|---|---|
| Role | Pure extraction library | Corpus management + orchestration |
| Knows about | Schemas, LLMs, providers, caching | Corpora, passages, metadata, ClickHouse |
| Input | str or list[str] |
Text IDs, database queries |
| Output | Pydantic models / dicts | Annotations, task paths, scalar features |
lltk imports largeliterarymodels (not the reverse). This means largeliterarymodels works anywhere -- laptops, Colab, HPC, cloud GPUs -- without needing a database connection.
When used together, lltk orchestrates the pipeline:
import lltk
# lltk resolves passages and calls largeliterarymodels tasks
lltk.annotate.run_task('genre', ids=['_estc/T068056'], model='gemini-2.5-flash')
lltk.annotate.run_task('social_network', ids=['_chadwyck/.../haywood.02'], model='vllm/qwen3.6-27b')
# Results stored in lltk's annotation system
# Full JSON blobs go to lltk.task_path()
# Scalar features go to lltk.annotations
Install together: pip install largeliterarymodels lltk-dh
Example: Bibliography Extraction
from largeliterarymodels.tasks import BibliographyTask, chunk_bibliography
task = BibliographyTask()
with open("data/bibliography.html") as f:
raw_html = f.read()
chunks = chunk_bibliography(raw_html, max_entries=20)
all_entries = task.map(chunks)
flat = [entry for chunk_entries in all_entries for entry in chunk_entries]
df = pd.DataFrame([e.model_dump() for e in flat])
df.to_csv("data/bibliography.csv", index=False)
Starting Your Own Project
Create a separate repository that depends on this package:
from pydantic import BaseModel, Field
from largeliterarymodels import Task
class MyEntry(BaseModel):
# your custom fields
...
class MyTask(Task):
schema = list[MyEntry]
system_prompt = "Your domain-specific instructions..."
examples = [...]
pip install largeliterarymodels
Available Tasks
| Task | Type | Input | Output |
|---|---|---|---|
GenreTask |
Base | Title/author metadata | Genre, subgenre, translation status, confidence |
GenreTaskLite |
Base | Title/author metadata | Constrained genre tags (form + mode) |
FryeTask |
Base | Text passages | Frye mode, mythos, referential mode |
PassageContentTask |
Sequential | Passage list | 43 binary content flags per passage |
PassageFormTask |
Sequential | Passage list | Formal/stylistic features per passage |
SocialNetworkTask |
Sequential | Passage list | Characters, relations, events, dialogue, summaries |
CharacterTask |
Base | BookNLP character roster | Merged/cleaned character list |
CharacterIntroTask |
Base | Character first-mention passages | Introduction mode, social class |
TranslationTask |
Base | Word in context | Historical translation + connotations |
BibliographyTask |
Base | OCR bibliography pages | Structured bibliography entries |
Base tasks process a single prompt and return a Pydantic model. Sequential tasks process a list of passages in chunks with rolling state and return an aggregated dict.
Project Structure
largeliterarymodels/
__init__.py # Exports: LLM, Task, model constants
llm.py # Core LLM class: generate, extract, map
task.py # Task + SequentialTask base classes
providers.py # Direct API calls to Anthropic, OpenAI, Google, local
tasks/ # Built-in task definitions (lazy-loaded)
analysis/ # Cross-task analysis: Fisher tests, ensembles, social networks
cli/ # litmod CLI: ls, show, smoke, run, annotate, cloud
integrations/ # ClickHouse adapter (being migrated to lltk)
annotate.py # FastAPI human-annotation web app
scripts/
batch_social_network.py # Batch runner for SocialNetworkTask (Colab/HPC/cloud)
analyze_social_networks.py # Network statistics across parsed texts
hpc/ # Passage export, SLURM scripts, Colab notebooks
cloud/ # Vast.ai standalone entry point
tests/ # Test suite (pytest)
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file largeliterarymodels-0.6.0.tar.gz.
File metadata
- Download URL: largeliterarymodels-0.6.0.tar.gz
- Upload date:
- Size: 163.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31ec289740edc34a0e8d193d16eb72f4f38ab50068b55404e00e9c4780a3e3ef
|
|
| MD5 |
e683ad05b8f650518529889d8adf867e
|
|
| BLAKE2b-256 |
efd8b97b21f83639d97db19f38610201d9a844c8351f61109c19a72acee22d43
|
File details
Details for the file largeliterarymodels-0.6.0-py3-none-any.whl.
File metadata
- Download URL: largeliterarymodels-0.6.0-py3-none-any.whl
- Upload date:
- Size: 173.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20e3efae7fc83df74680806adafaf584fac2ad8789767e4f28e3edb3c86b0857
|
|
| MD5 |
f05949c0b69fb04762e96ffe8e8d543b
|
|
| BLAKE2b-256 |
22cc90dc6354e7e7e3cd99ddce6530e341f4780f3935b1cd92ecac5cc0389764
|