Skip to main content

LLMs for digital humanities: structured data extraction from unstructured texts

Project description

Large Literary Models

PyPI version Tests

A Python toolkit for using Large Language Models (LLMs) to produce structured, annotated data from unstructured texts. Built for digital humanities research.

What it does: You give it messy text (OCR scans, bibliographies, novel excerpts, archival documents) and a description of the structured data you want back (characters, citations, sentiments, relationships). It sends the text to an LLM, parses the response into clean structured data, caches everything so you never pay for the same query twice, and hands you back validated Python objects you can export to CSV or a pandas DataFrame.

Supported LLM providers: Claude (Anthropic), GPT (OpenAI), Gemini (Google). No account with all three is needed -- any one will work.

Table of Contents

Installation

Prerequisites

You need Python 3.10 or later. To check your version, open a terminal and run:

python --version

If you see something like Python 3.10.6 or higher, you're good. If not, install a newer Python from python.org or via pyenv.

Install from PyPI

The simplest way to install:

pip install largeliterarymodels

This installs largeliterarymodels and all its dependencies: the Anthropic, OpenAI, and Google AI client libraries, pydantic for structured data extraction, and HashStash for caching.

Install with LLTK (for literary corpus analysis)

To use the built-in literary analysis tasks (genre classification, character networks, Frye mode analysis) with LLTK corpora:

pip install "largeliterarymodels[lltk]"

This adds lltk-dh, which provides 50+ literary corpora, cross-corpus matching, and DuckDB-backed metadata. See Using with LLTK below.

Install from source (for development)

If you want to modify the library itself:

git clone https://github.com/quadrismegistus/largeliterarymodels.git
cd largeliterarymodels
python -m venv .venv
source .venv/bin/activate   # On Windows: .venv\Scripts\activate
pip install -e ".[dev]"

Setup: API Keys

To use an LLM, you need an API key from at least one provider. You only need one, but having multiple lets you compare models.

Provider Get a key at Environment variable
Anthropic (Claude) console.anthropic.com ANTHROPIC_API_KEY
OpenAI (GPT) platform.openai.com OPENAI_API_KEY
Google (Gemini) aistudio.google.com GEMINI_API_KEY

Once you have a key, set it in your terminal before running any code:

# Pick whichever provider(s) you have:
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
export OPENAI_API_KEY="sk-your-key-here"
export GEMINI_API_KEY="your-key-here"

To avoid typing these every time, add the export lines to your shell profile (~/.bashrc, ~/.zshrc, etc.) or create a .env file in your project directory.

To check which keys are set:

from largeliterarymodels import check_api_keys
check_api_keys(verbose=True)
  + ANTHROPIC_API_KEY
  X OPENAI_API_KEY
  + GEMINI_API_KEY

Quick Start

Basic text generation

from largeliterarymodels import LLM

# Create an LLM instance (defaults to Claude Sonnet)
llm = LLM()

# Or specify a model:
from largeliterarymodels import CLAUDE_OPUS, GPT_4O_MINI, GEMINI_FLASH
llm = LLM(GPT_4O_MINI)
llm = LLM(GEMINI_FLASH)

# Generate text
response = llm.generate("What is the plot of Pamela by Samuel Richardson?")
print(response)

The response is automatically cached. If you run the exact same call again, it returns instantly without using any API credits.

Changing default parameters

from largeliterarymodels import LLM, CLAUDE_SONNET

# Lower temperature = more deterministic output
llm = LLM(CLAUDE_SONNET, temperature=0.2)

# Set a system prompt that applies to all calls
llm = LLM(system_prompt="You are an expert in 18th-century English literature.")
response = llm.generate("Who is Pamela?")

# Override per-call
response = llm.generate(
    "Who is Pamela?",
    system_prompt="You are a children's librarian. Explain simply.",
    temperature=0.9,
)

Structured Extraction

This is the core feature. Instead of getting back free-form text, you define a schema -- a description of the exact fields you want -- and the LLM fills them in.

You define schemas using pydantic, which is a way of describing data structures in Python.

A simple example

from pydantic import BaseModel, Field
from largeliterarymodels import LLM

# Define what you want back
class Sentiment(BaseModel):
    sentiment: str = Field(description="positive, negative, or neutral")
    confidence: float = Field(description="confidence score from 0.0 to 1.0")
    explanation: str = Field(description="one-sentence explanation")

llm = LLM()
result = llm.extract(
    "It was the best of times, it was the worst of times.",
    schema=Sentiment,
)

print(result.sentiment)     # "neutral"
print(result.confidence)    # 0.75
print(result.explanation)   # "The passage juxtaposes extremes..."

The result is a validated Python object -- not a string, not raw JSON. You can access its fields with dot notation.

Extracting a list of items

Use list[YourModel] when you expect multiple results:

class Character(BaseModel):
    name: str
    role: str = Field(description="role in the narrative")
    gender: str

llm = LLM()
characters = llm.extract(
    "Who are the main characters in Pride and Prejudice?",
    schema=list[Character],
    system_prompt="You are a literary scholar.",
)

for c in characters:
    print(f"{c.name} ({c.gender}): {c.role}")
Elizabeth Bennet (Female): Protagonist; witty and independent young woman
Mr. Darcy (Male): Male lead; proud wealthy gentleman
Jane Bennet (Female): Elizabeth's gentle elder sister
...

Adding context with system prompts

The system_prompt tells the LLM how to approach the task -- what expertise to assume, what conventions to follow:

result = llm.extract(
    scene_text,
    schema=BechdelResult,
    system_prompt="You are a film critic. Assess whether this scene passes the Bechdel test.",
)

Few-shot examples

Few-shot examples show the LLM exactly what you expect. Each example is a pair: (input_text, expected_output). The output can be a pydantic object or a plain dictionary.

examples = [
    # Example 1: show the LLM what good output looks like
    (
        "[INT. HOUSE]\nEMILY: What do you think about Michael?\nEMMA: He seems risky.",
        Sentiment(sentiment="negative", confidence=0.7, explanation="Apprehension about Michael."),
    ),
    # Example 2: a contrasting case
    (
        "The sun shone brightly on the meadow.",
        Sentiment(sentiment="positive", confidence=0.85, explanation="Bright, pleasant imagery."),
    ),
]

result = llm.extract(
    "The room was dark and cold.",
    schema=Sentiment,
    examples=examples,
)

Few-shot examples dramatically improve accuracy, especially for domain-specific tasks. Even one or two examples help.

Error handling

If the LLM returns malformed JSON, extract() will automatically retry (once by default). You can control this:

result = llm.extract(prompt, schema=MySchema, retries=3)  # up to 3 retries

Defining a Task

A Task bundles together everything needed for a specific extraction job: the schema, system prompt, examples, and configuration. This means you define your task once, then reuse it across many inputs and models.

from pydantic import BaseModel, Field
from largeliterarymodels import Task

class Character(BaseModel):
    name: str
    gender: str
    role: str = Field(description="role in the narrative")
    prominence: int = Field(description="1-10 prominence score")

class CharacterTask(Task):
    schema = list[Character]
    system_prompt = "You are a literary scholar. Extract all named characters from the text."
    examples = [
        (
            "Mr. Darcy danced with Elizabeth at the ball.",
            [
                Character(name="Mr. Darcy", gender="Male", role="love interest", prominence=9),
                Character(name="Elizabeth", gender="Female", role="protagonist", prominence=10),
            ],
        ),
    ]
    retries = 2

Using a Task

task = CharacterTask()

# Extract from one text
characters = task.run(chapter_text)
for c in characters:
    print(f"{c.name}: {c.role} ({c.prominence}/10)")

# Try a different model
characters_gpt = task.run(chapter_text, model="gpt-4o-mini")

# Override system prompt for one call
characters = task.run(chapter_text, system_prompt="Focus only on female characters.")

Task caching and results

Each task gets its own separate cache directory (at data/stash/<TaskClassName>/). This keeps results organized and means you can clear one task's cache without affecting others.

You can access all cached results as a DataFrame at any time:

task = CharacterTask()
task.map(chapter_texts)    # populate the cache

# Get all results as a DataFrame
df = task.df
print(df.head())

The DataFrame includes metadata columns (model, temperature, prompt) alongside all schema fields. For list[Model] schemas, each item in the list becomes its own row.

Working with Multiple Prompts

Batch generation

llm = LLM()
responses = llm.map(
    ["Summarize Chapter 1.", "Summarize Chapter 2.", "Summarize Chapter 3."],
    system_prompt="Summarize in one paragraph.",
)
# responses is a list of strings, in the same order as the prompts

Batch extraction

task = CharacterTask()
results = task.map(
    [chapter_1_text, chapter_2_text, chapter_3_text],
    model="claude-sonnet-4-6",
)
# results is a list of list[Character], one per chapter

Both map() methods run requests in parallel (4 threads by default), show a progress bar, and cache results. Re-running the same batch skips already-cached prompts:

# This will only compute the new chapters, not re-do 1-3
results = task.map(
    [chapter_1_text, chapter_2_text, chapter_3_text, chapter_4_text],
)

Exporting to CSV

Since extraction results are pydantic objects, converting to a pandas DataFrame is straightforward:

import pandas as pd

# If results is a list of lists (from task.map), flatten first
flat = [entry for chunk in results for entry in chunk]

df = pd.DataFrame([entry.model_dump() for entry in flat])
df.to_csv("characters.csv", index=False)
print(df.head())

Or use the task's built-in DataFrame (see Task caching and results above).

Caching

All LLM calls are automatically cached using HashStash. The cache key is the combination of:

  • prompt (the input text)
  • model (which LLM you used)
  • system_prompt
  • temperature
  • max_tokens
  • schema name (for extract() calls)

This means:

  • Same prompt + same model = cached (instant, free)
  • Same prompt + different model = separate cache entry (lets you compare models)
  • Same prompt + different system_prompt = separate cache entry

Cache is stored in data/stash/ inside the repository. To force a fresh generation:

response = llm.generate("What is the plot of Pamela?", force=True)

Provider-side prompt caching

On top of local HashStash caching, largeliterarymodels also turns on provider-side prompt caching where the API supports it. For long Task system prompts (which include few-shot examples), this cuts input cost ~10x on repeat calls within a single batch — Anthropic's default 5-minute cache window covers typical task.map() workloads.

  • Anthropic: enabled automatically. providers.call_anthropic marks the system field with cache_control: {type: "ephemeral"} on every call. Cache hits bill at ~10% of input rate.
  • OpenAI: automatic on the API side; no client-side change needed. Hits above 1024 tokens bill at ~50% of input rate.
  • Gemini: explicit context caching exists (client.caches.create) but the minimum of ~32K tokens makes it irrelevant for Task-sized system prompts. Not implemented.

Caching thresholds are model-specific — and higher than the docs claim. Empirically verified (April 2026):

Model Docs say Reality (caches at ≥)
Claude Sonnet 4.6 1024 ~2048 tokens
Claude Haiku 4.5 1024 ~6000 tokens
Claude Opus 4.7 1024 ~2048 (assumed by analogy)
GPT-4o family 1024 1024 (automatic)

Below the threshold, the cache_control marker is silently ignored — no error, just no cache. This matters most for cost-optimized Haiku batches: a Task whose system prompt fits in 3–4K tokens will cache cleanly on Sonnet but not on Haiku, and 27K calls at Haiku-uncached pricing is ~$70, not the $5-10 a naive Haiku estimate suggests.

Auditing a Task's system prompt size:

import tiktoken
from largeliterarymodels.llm import _build_extract_prompt
from largeliterarymodels.tasks import TranslationTask

t = TranslationTask()
full_system, _ = _build_extract_prompt(
    prompt="", schema=t.schema,
    system_prompt=t.system_prompt, examples=t.examples,
)
n = len(tiktoken.get_encoding("cl100k_base").encode(full_system))
# tiktoken undercounts ~18% vs Anthropic's tokenizer.
# Safe caching margins:
#   Sonnet/Opus: n ≥ 2400 tiktoken (~2800 Anthropic)
#   Haiku:       n ≥ 5500 tiktoken (~6500 Anthropic)

Of the built-in tasks, as of April 2026: GenreTask, FryeTask, PassageTask, CharacterIntroTask, and TranslationTask cache on Sonnet/Opus. Only PassageTask currently crosses the Haiku threshold. If you plan a high-volume Haiku run with a different task, consider expanding its examples list with genuinely useful additions until it crosses 6K tokens.

Using local models (Ollama, vLLM, LM Studio)

Any OpenAI-compatible local inference server works by using one of the local prefixes:

from largeliterarymodels import LLM
llm = LLM(model="local/llama3.3")  # or "ollama/mistral", "vllm/qwen2.5", "lmstudio/..."
response = llm.generate("Translate 'freedom' to German.")

By default, largeliterarymodels points at Ollama's usual endpoint (http://localhost:11434/v1). Override by setting LOCAL_BASE_URL in the environment (e.g. a different port, a remote vLLM host, or LM Studio on a LAN box).

export LOCAL_BASE_URL="http://localhost:8000/v1"   # vLLM default
export LOCAL_BASE_URL="http://192.168.1.50:11434/v1"  # remote Ollama

The local provider reuses the OpenAI SDK, so any inference server that speaks the OpenAI chat-completions protocol (Ollama, vLLM, LM Studio, llama.cpp server, SGLang, Together's self-host mode) will work without further config.

Thinking / reasoning mode. Ollama's reasoning models (Qwen 3, DeepSeek R1, Gemma 4) expose a think: true|false toggle on their native /api/chat endpoint, with chain-of-thought returned in a separate message.thinking field. This library's call_local goes through the OpenAI-compatible /v1/chat/completions endpoint, which does not accept the think parameter — so by default, reasoning models run in their no-think mode when invoked via LLM(model="ollama/qwen3:14b"). In practice, empirical testing on TranslationTask and GenreTask shows no-think mode is both faster (10–20x) and produces more reliably-parsed JSON than thinking mode; thinking responses under a tight num_predict budget silently return empty content. If you specifically need chain-of-thought output from a local model, POST directly to http://localhost:11434/api/chat with {"think": true, "options": {"num_predict": 8192}} rather than using the LLM class.

Quality caveat. Open-weight models are meaningfully below API-tier Claude and GPT on the workloads this library is built for:

  • Structured JSON compliance: small open models produce malformed JSON more often; retries compound at batch scale.
  • Specialist literary knowledge: GenreTask/FryeTask/PassageTask rely on recognition of specific early-modern works and authors. Open models under 70B rarely have the coverage.
  • Multilingual nuance: Llama 3.3 70B's German/French translation quality is noticeably below Haiku 4.5 on TranslationTask-shaped prompts.

Treat local models as complements rather than drop-in replacements:

  • Validation / redundancy passes: back-translation checks, cross-model agreement tests — the different model family is an asset.
  • Development and iteration: draft prompts, debug schemas, run regression tests without burning API credits.
  • New tasks where quality tolerance is high: initial exploration, rough-cut classification at scale.

Don't swap a validated Haiku/Sonnet pipeline to a local model as a pure cost optimization — the quality drop will usually show up in the output.

Example: Bibliography Extraction

The library ships with a ready-made task for parsing messy OCR bibliography entries into structured data. This is a real-world example of the kind of work largeliterarymodels is designed for.

from largeliterarymodels.tasks import BibliographyTask, chunk_bibliography
import pandas as pd

# Load the task
task = BibliographyTask()

# Load and chunk your HTML file
with open("data/bibliography.html") as f:
    raw_html = f.read()

# Split into chunks (by year heading, max 20 entries each)
chunks = chunk_bibliography(raw_html, max_entries=20)

# Extract from all chunks (parallel, cached)
all_entries = task.map(chunks)

# Flatten and export
flat = [entry for chunk_entries in all_entries for entry in chunk_entries]
df = pd.DataFrame([e.model_dump() for e in flat])
df.to_csv("data/bibliography.csv", index=False)

# Or use the built-in DataFrame
df = task.df

The BibliographyEntry schema extracts fields including: author, title, subtitle, year, edition, bibliographic ID, translation status, translator, printer, publisher, bookseller, and notes. See largeliterarymodels/tasks/extract_bibliography.py for the full schema and few-shot examples.

Comparing models

from largeliterarymodels import CLAUDE_SONNET, GPT_4O_MINI, GEMINI_FLASH

for model in [CLAUDE_SONNET, GPT_4O_MINI, GEMINI_FLASH]:
    entries = task.run(chunks[0], model=model)
    print(f"{model}: {len(entries)} entries extracted")

Starting Your Own Project

largeliterarymodels is a general-purpose toolkit. For your specific research project, we recommend creating a separate repository that depends on it:

my-bibliography-project/
    task.py                 # your Task subclass with custom schema/examples
    data/
        source.html         # your input data
        output.csv          # your results
    notebooks/
        extract.ipynb       # your working notebook

Your task.py defines only what's specific to your project:

from pydantic import BaseModel, Field
from largeliterarymodels import Task

class MyEntry(BaseModel):
    # your custom fields here
    ...

class MyBibliographyTask(Task):
    schema = list[MyEntry]
    system_prompt = "Your domain-specific instructions..."
    examples = [...]

Install largeliterarymodels in your project's environment:

pip install largeliterarymodels

This way your project-specific decisions (field names, few-shot examples, OCR quirks) live in their own tracked repository, separate from the general-purpose toolkit.

Using with LLTK

The library includes tasks designed for literary analysis with LLTK (Literary Language Toolkit), which provides 50+ literary corpora, cross-corpus deduplication, and DuckDB-backed metadata. Install together with pip install "largeliterarymodels[lltk]", or install LLTK separately with pip install lltk-dh.

Genre classification

Classify texts by genre from title/author metadata:

from largeliterarymodels.tasks import GenreTask, format_text_for_classification

task = GenreTask()
prompt = format_text_for_classification(title="Pamela", author_norm="richardson", year=1740)
result = task.run(prompt)
print(result.genre, result.genre_raw, result.confidence)
# Fiction Novel, Epistolary fiction 1.0

Character resolution (BookNLP cleanup)

BookNLP's NER is noisy on early modern texts. This task merges fragmented character clusters and filters noise:

import lltk
from largeliterarymodels.tasks import CharacterTask, format_character_roster

t = lltk.load('chadwyck').text('Eighteenth-Century_Fiction/fieldinh.06')  # Tom Jones
t.booknlp.parse()  # run BookNLP first

task = CharacterTask()
prompt = format_character_roster(t, max_chars=30)
results = task.run(prompt)  # returns list[CharacterResolution]
for r in results:
    if r.type == 'character':
        print(f"{r.name}: {r.ids}")
# Tom Jones: ['C822', 'C625', 'C491']
# Sophia Western: ['C821', 'C888', 'C4113']

Or use the LLTK wrapper directly:

t.booknlp.resolve_characters()   # runs CharacterTask, saves JSON
t.booknlp.plot_network()         # co-mention network visualization

Available tasks

Task Input Output
GenreTask Title/author metadata Genre, subgenre, translation status
FryeTask Text passages (opening/middle/closing) Frye mode, mythos, referential mode
PassageTask ~1K-word passages Scene type, narration mode, allegorical regime
CharacterTask BookNLP character roster Merged/cleaned character list
CharacterIntroTask Character first-mention passages Introduction mode, social class, interiority
BibliographyTask OCR bibliography pages Structured bibliography entries

Model Constants

For convenience, common model names are available as constants:

from largeliterarymodels import (
    CLAUDE_OPUS,    # claude-opus-4-6
    CLAUDE_SONNET,  # claude-sonnet-4-6
    CLAUDE_HAIKU,   # claude-haiku-4-5-20251001
    GPT_4O,         # gpt-4o
    GPT_4O_MINI,    # gpt-4o-mini
    GEMINI_PRO,     # gemini-2.5-pro
    GEMINI_FLASH,   # gemini-2.5-flash
)

llm = LLM(CLAUDE_OPUS)

Project Structure

largeliterarymodels/
    __init__.py              # Exports: LLM, Task, model constants, check_api_keys
    llm.py                   # Core LLM class: generate, extract, map, extract_map
    task.py                  # Task class: reusable extraction task definition
    providers.py             # Direct API calls to Anthropic, OpenAI, Google
    utils.py                 # Utility functions
    tasks/
        extract_bibliography.py  # Built-in bibliography extraction task
tests/                       # Test suite (run with: pytest)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

largeliterarymodels-0.3.0.tar.gz (98.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

largeliterarymodels-0.3.0-py3-none-any.whl (92.4 kB view details)

Uploaded Python 3

File details

Details for the file largeliterarymodels-0.3.0.tar.gz.

File metadata

  • Download URL: largeliterarymodels-0.3.0.tar.gz
  • Upload date:
  • Size: 98.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for largeliterarymodels-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d01ce7b07118534038d60be1e421226c437beedf305354cc58ad8c49e670ce7c
MD5 0a365fd3829a5bc42a433da4f32c523c
BLAKE2b-256 2f032a5654b1ac78fead76bff5a3764d61534fbe0ff8a2ab7a1d343367fe5735

See more details on using hashes here.

File details

Details for the file largeliterarymodels-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for largeliterarymodels-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6588e3864bf5789f7c3646986fecc684823c56f2a5f167a9b4082373fc6a71ce
MD5 c512448fe96dfcffd56ef10b4e99dc1f
BLAKE2b-256 32a3a6ee04714ee960595d94524d2ad738eb30d32631dfc7f559ab55b444b16c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page