Skip to main content

LLMs for digital humanities: structured data extraction from unstructured texts

Project description

Large Literary Models

PyPI version Tests

A Python toolkit for using Large Language Models (LLMs) to produce structured, annotated data from unstructured texts. Built for digital humanities research.

What it does: You give it messy text (OCR scans, bibliographies, novel excerpts, archival documents) and a description of the structured data you want back (characters, citations, sentiments, relationships). It sends the text to an LLM, parses the response into clean structured data, caches everything so you never pay for the same query twice, and hands you back validated Python objects you can export to CSV or a pandas DataFrame.

Supported LLM providers: Claude (Anthropic), GPT (OpenAI), Gemini (Google). No account with all three is needed -- any one will work.

Table of Contents

Installation

Prerequisites

You need Python 3.10 or later. To check your version, open a terminal and run:

python --version

If you see something like Python 3.10.6 or higher, you're good. If not, install a newer Python from python.org or via pyenv.

Install from PyPI

The simplest way to install:

pip install largeliterarymodels

This installs largeliterarymodels and all its dependencies (the Anthropic, OpenAI, and Google AI client libraries, plus HashStash for caching).

To also install pydantic for structured data extraction (recommended):

pip install "largeliterarymodels[pydantic]"

Install from source (for development)

If you want to modify the library itself:

git clone https://github.com/quadrismegistus/largeliterarymodels.git
cd largeliterarymodels
python -m venv .venv
source .venv/bin/activate   # On Windows: .venv\Scripts\activate
pip install -e ".[dev]"

Setup: API Keys

To use an LLM, you need an API key from at least one provider. You only need one, but having multiple lets you compare models.

Provider Get a key at Environment variable
Anthropic (Claude) console.anthropic.com ANTHROPIC_API_KEY
OpenAI (GPT) platform.openai.com OPENAI_API_KEY
Google (Gemini) aistudio.google.com GEMINI_API_KEY

Once you have a key, set it in your terminal before running any code:

# Pick whichever provider(s) you have:
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
export OPENAI_API_KEY="sk-your-key-here"
export GEMINI_API_KEY="your-key-here"

To avoid typing these every time, add the export lines to your shell profile (~/.bashrc, ~/.zshrc, etc.) or create a .env file in your project directory.

To check which keys are set:

from largeliterarymodels import check_api_keys
check_api_keys(verbose=True)
  + ANTHROPIC_API_KEY
  X OPENAI_API_KEY
  + GEMINI_API_KEY

Quick Start

Basic text generation

from largeliterarymodels import LLM

# Create an LLM instance (defaults to Claude Sonnet)
llm = LLM()

# Or specify a model:
from largeliterarymodels import CLAUDE_OPUS, GPT_4O_MINI, GEMINI_FLASH
llm = LLM(GPT_4O_MINI)
llm = LLM(GEMINI_FLASH)

# Generate text
response = llm.generate("What is the plot of Pamela by Samuel Richardson?")
print(response)

The response is automatically cached. If you run the exact same call again, it returns instantly without using any API credits.

Changing default parameters

from largeliterarymodels import LLM, CLAUDE_SONNET

# Lower temperature = more deterministic output
llm = LLM(CLAUDE_SONNET, temperature=0.2)

# Set a system prompt that applies to all calls
llm = LLM(system_prompt="You are an expert in 18th-century English literature.")
response = llm.generate("Who is Pamela?")

# Override per-call
response = llm.generate(
    "Who is Pamela?",
    system_prompt="You are a children's librarian. Explain simply.",
    temperature=0.9,
)

Structured Extraction

This is the core feature. Instead of getting back free-form text, you define a schema -- a description of the exact fields you want -- and the LLM fills them in.

You define schemas using pydantic, which is a way of describing data structures in Python.

A simple example

from pydantic import BaseModel, Field
from largeliterarymodels import LLM

# Define what you want back
class Sentiment(BaseModel):
    sentiment: str = Field(description="positive, negative, or neutral")
    confidence: float = Field(description="confidence score from 0.0 to 1.0")
    explanation: str = Field(description="one-sentence explanation")

llm = LLM()
result = llm.extract(
    "It was the best of times, it was the worst of times.",
    schema=Sentiment,
)

print(result.sentiment)     # "neutral"
print(result.confidence)    # 0.75
print(result.explanation)   # "The passage juxtaposes extremes..."

The result is a validated Python object -- not a string, not raw JSON. You can access its fields with dot notation.

Extracting a list of items

Use list[YourModel] when you expect multiple results:

class Character(BaseModel):
    name: str
    role: str = Field(description="role in the narrative")
    gender: str

llm = LLM()
characters = llm.extract(
    "Who are the main characters in Pride and Prejudice?",
    schema=list[Character],
    system_prompt="You are a literary scholar.",
)

for c in characters:
    print(f"{c.name} ({c.gender}): {c.role}")
Elizabeth Bennet (Female): Protagonist; witty and independent young woman
Mr. Darcy (Male): Male lead; proud wealthy gentleman
Jane Bennet (Female): Elizabeth's gentle elder sister
...

Adding context with system prompts

The system_prompt tells the LLM how to approach the task -- what expertise to assume, what conventions to follow:

result = llm.extract(
    scene_text,
    schema=BechdelResult,
    system_prompt="You are a film critic. Assess whether this scene passes the Bechdel test.",
)

Few-shot examples

Few-shot examples show the LLM exactly what you expect. Each example is a pair: (input_text, expected_output). The output can be a pydantic object or a plain dictionary.

examples = [
    # Example 1: show the LLM what good output looks like
    (
        "[INT. HOUSE]\nEMILY: What do you think about Michael?\nEMMA: He seems risky.",
        Sentiment(sentiment="negative", confidence=0.7, explanation="Apprehension about Michael."),
    ),
    # Example 2: a contrasting case
    (
        "The sun shone brightly on the meadow.",
        Sentiment(sentiment="positive", confidence=0.85, explanation="Bright, pleasant imagery."),
    ),
]

result = llm.extract(
    "The room was dark and cold.",
    schema=Sentiment,
    examples=examples,
)

Few-shot examples dramatically improve accuracy, especially for domain-specific tasks. Even one or two examples help.

Error handling

If the LLM returns malformed JSON, extract() will automatically retry (once by default). You can control this:

result = llm.extract(prompt, schema=MySchema, retries=3)  # up to 3 retries

Defining a Task

A Task bundles together everything needed for a specific extraction job: the schema, system prompt, examples, and configuration. This means you define your task once, then reuse it across many inputs and models.

from pydantic import BaseModel, Field
from largeliterarymodels import Task

class Character(BaseModel):
    name: str
    gender: str
    role: str = Field(description="role in the narrative")
    prominence: int = Field(description="1-10 prominence score")

class CharacterTask(Task):
    schema = list[Character]
    system_prompt = "You are a literary scholar. Extract all named characters from the text."
    examples = [
        (
            "Mr. Darcy danced with Elizabeth at the ball.",
            [
                Character(name="Mr. Darcy", gender="Male", role="love interest", prominence=9),
                Character(name="Elizabeth", gender="Female", role="protagonist", prominence=10),
            ],
        ),
    ]
    retries = 2

Using a Task

task = CharacterTask()

# Extract from one text
characters = task.run(chapter_text)
for c in characters:
    print(f"{c.name}: {c.role} ({c.prominence}/10)")

# Try a different model
characters_gpt = task.run(chapter_text, model="gpt-4o-mini")

# Override system prompt for one call
characters = task.run(chapter_text, system_prompt="Focus only on female characters.")

Task caching and results

Each task gets its own separate cache directory (at data/stash/<TaskClassName>/). This keeps results organized and means you can clear one task's cache without affecting others.

You can access all cached results as a DataFrame at any time:

task = CharacterTask()
task.map(chapter_texts)    # populate the cache

# Get all results as a DataFrame
df = task.df
print(df.head())

The DataFrame includes metadata columns (model, temperature, prompt) alongside all schema fields. For list[Model] schemas, each item in the list becomes its own row.

Working with Multiple Prompts

Batch generation

llm = LLM()
responses = llm.map(
    ["Summarize Chapter 1.", "Summarize Chapter 2.", "Summarize Chapter 3."],
    system_prompt="Summarize in one paragraph.",
)
# responses is a list of strings, in the same order as the prompts

Batch extraction

task = CharacterTask()
results = task.map(
    [chapter_1_text, chapter_2_text, chapter_3_text],
    model="claude-sonnet-4-6",
)
# results is a list of list[Character], one per chapter

Both map() methods run requests in parallel (4 threads by default), show a progress bar, and cache results. Re-running the same batch skips already-cached prompts:

# This will only compute the new chapters, not re-do 1-3
results = task.map(
    [chapter_1_text, chapter_2_text, chapter_3_text, chapter_4_text],
)

Exporting to CSV

Since extraction results are pydantic objects, converting to a pandas DataFrame is straightforward:

import pandas as pd

# If results is a list of lists (from task.map), flatten first
flat = [entry for chunk in results for entry in chunk]

df = pd.DataFrame([entry.model_dump() for entry in flat])
df.to_csv("characters.csv", index=False)
print(df.head())

Or use the task's built-in DataFrame (see Task caching and results above).

Caching

All LLM calls are automatically cached using HashStash. The cache key is the combination of:

  • prompt (the input text)
  • model (which LLM you used)
  • system_prompt
  • temperature
  • max_tokens
  • schema name (for extract() calls)

This means:

  • Same prompt + same model = cached (instant, free)
  • Same prompt + different model = separate cache entry (lets you compare models)
  • Same prompt + different system_prompt = separate cache entry

Cache is stored in data/stash/ inside the repository. To force a fresh generation:

response = llm.generate("What is the plot of Pamela?", force=True)

Example: Bibliography Extraction

The library ships with a ready-made task for parsing messy OCR bibliography entries into structured data. This is a real-world example of the kind of work largeliterarymodels is designed for.

from largeliterarymodels.tasks import BibliographyTask, chunk_bibliography
import pandas as pd

# Load the task
task = BibliographyTask()

# Load and chunk your HTML file
with open("data/bibliography.html") as f:
    raw_html = f.read()

# Split into chunks (by year heading, max 20 entries each)
chunks = chunk_bibliography(raw_html, max_entries=20)

# Extract from all chunks (parallel, cached)
all_entries = task.map(chunks)

# Flatten and export
flat = [entry for chunk_entries in all_entries for entry in chunk_entries]
df = pd.DataFrame([e.model_dump() for e in flat])
df.to_csv("data/bibliography.csv", index=False)

# Or use the built-in DataFrame
df = task.df

The BibliographyEntry schema extracts fields including: author, title, subtitle, year, edition, bibliographic ID, translation status, translator, printer, publisher, bookseller, and notes. See largeliterarymodels/tasks/extract_bibliography.py for the full schema and few-shot examples.

Comparing models

from largeliterarymodels import CLAUDE_SONNET, GPT_4O_MINI, GEMINI_FLASH

for model in [CLAUDE_SONNET, GPT_4O_MINI, GEMINI_FLASH]:
    entries = task.run(chunks[0], model=model)
    print(f"{model}: {len(entries)} entries extracted")

Starting Your Own Project

largeliterarymodels is a general-purpose toolkit. For your specific research project, we recommend creating a separate repository that depends on it:

my-bibliography-project/
    task.py                 # your Task subclass with custom schema/examples
    data/
        source.html         # your input data
        output.csv          # your results
    notebooks/
        extract.ipynb       # your working notebook

Your task.py defines only what's specific to your project:

from pydantic import BaseModel, Field
from largeliterarymodels import Task

class MyEntry(BaseModel):
    # your custom fields here
    ...

class MyBibliographyTask(Task):
    schema = list[MyEntry]
    system_prompt = "Your domain-specific instructions..."
    examples = [...]

Install largeliterarymodels in your project's environment:

pip install largeliterarymodels

This way your project-specific decisions (field names, few-shot examples, OCR quirks) live in their own tracked repository, separate from the general-purpose toolkit.

Model Constants

For convenience, common model names are available as constants:

from largeliterarymodels import (
    CLAUDE_OPUS,    # claude-opus-4-6
    CLAUDE_SONNET,  # claude-sonnet-4-6
    CLAUDE_HAIKU,   # claude-haiku-4-5-20251001
    GPT_4O,         # gpt-4o
    GPT_4O_MINI,    # gpt-4o-mini
    GEMINI_PRO,     # gemini-2.5-pro
    GEMINI_FLASH,   # gemini-2.5-flash
)

llm = LLM(CLAUDE_OPUS)

Project Structure

largeliterarymodels/
    __init__.py              # Exports: LLM, Task, model constants, check_api_keys
    llm.py                   # Core LLM class: generate, extract, map, extract_map
    task.py                  # Task class: reusable extraction task definition
    providers.py             # Direct API calls to Anthropic, OpenAI, Google
    utils.py                 # Utility functions
    tasks/
        extract_bibliography.py  # Built-in bibliography extraction task
tests/                       # Test suite (run with: pytest)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

largeliterarymodels-0.1.0.tar.gz (40.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

largeliterarymodels-0.1.0-py3-none-any.whl (31.8 kB view details)

Uploaded Python 3

File details

Details for the file largeliterarymodels-0.1.0.tar.gz.

File metadata

  • Download URL: largeliterarymodels-0.1.0.tar.gz
  • Upload date:
  • Size: 40.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for largeliterarymodels-0.1.0.tar.gz
Algorithm Hash digest
SHA256 867fa3d7cfc16abd4698c99b493c7220f9df5f280c31bdc35ca4b991c5f3fb92
MD5 45f23a7e08406159fe628b2dd7d23b2b
BLAKE2b-256 51b1974d8d8ba7290939042202a51e88432d2a31d5246e487405f0a9623d7008

See more details on using hashes here.

File details

Details for the file largeliterarymodels-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for largeliterarymodels-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e79a2b36886974699c8ef5da9e393c57eea8302b518c123e865769199b802750
MD5 935ed31e71dd7e67841a40228b8b09c4
BLAKE2b-256 8090ecf27c04776c2fb249253fad8d9063a1e8a6e1b6884e4cbf7788a692cdbd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page