U.S. Presidential Annual Messages and State of the Union Addresses (1790-present), sourced from the UC Santa Barbara American Presidency Project.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Denubis

These details have not been verified by PyPI

Project links

Source Archive (UCSB)

Project description

SOTU: U.S. State of the Union Addresses & Annual Messages (1790–present)

A Python package containing the full corpus of U.S. Presidential Annual Messages (1790–1946) and State of the Union Addresses (1947–present), with a pandas-DataFrame loader and an NLTK-style file-id reader interface. Designed to give Python users the same affordances as the R sotu package and its quanteda corpus integration.

This package is for digital humanities scholars, political scientists, and educators who want clean, offline, citable access to every SOTU from 1790 through 2026 (and beyond).

All text content in this package comes from the UC Santa Barbara American Presidency Project (https://www.presidency.ucsb.edu/), the authoritative scholarly archive curated by Gerhard Peters and John T. Woolley. This Python package only repackages their texts for programmatic access; the scholarly work of transcription, dating, classification, and verification is theirs. Please cite them in any research that uses this corpus — see Citation below.

Quickstart

Installation

uv add sotu
# or using standard pip:
pip install sotu

Or directly from source:

git clone https://github.com/DH-Oz/sotu.git
cd sotu
uv pip install .

Basic Usage

Get every address into a pandas DataFrame in one line:

import sotu

df = sotu.load()
print(df.head())
# Columns: ['year', 'president', 'party', 'sotu_type', 'text']

# Filter by speech type
spoken = df[df.sotu_type == "spoken"]    # delivered orally
written = df[df.sotu_type == "written"]  # submitted as written messages

# Filter by president (last name only — the locked contract uses
# surnames so "Bush" returns both H.W. and W.)
washington = df[df.president == "Washington"]

# Programmatic coverage bounds
print(sotu.COVERAGE)  # (1790, 2026)

For full names and disambiguation columns (e.g. president_full to tell George H. W. Bush from George W. Bush), pass full=True:

df_full = sotu.load(full=True)
# Adds: 'fileid', 'president_full', 'is_sotu', 'date', 'source_url',
#       'word_count', 'sha256', ...

Canonical SOTUs vs. related UCSB documents

By default sotu.load() returns one canonical SOTU per (year, president). The UCSB archive also tags a small number of related documents under the "State of the Union" taxonomy that aren't the SOTU itself — Nixon's 1973 series of policy-specific Special Messages to Congress, Roosevelt's 1945 radio summary of the written SOTU, Eisenhower's 1956 Key West remarks. The corpus carries those rows with is_sotu=False so scholars can still access them:

archive = sotu.load(full=True, include_related=True)
related = archive[~archive.is_sotu]
print(related[["year", "president", "source_url"]])

Years with multiple canonical rows are legitimate spoken+written pairs — Nixon 1972/74 and Carter 1978-80 each gave a delivered address and submitted a longer written message to Congress on the same date.

Detailed Usage

NLTK-style corpus accessors

For NLTK-style access or granular text reading, use the file-id methods:

# Every fileid, sorted chronologically
fileids = sotu.fileids()
print(fileids[:5])
# ['1790-Washington-1', '1790-Washington-2', ...]

# Raw cleaned plain text of a single address
speech_text = sotu.raw('1790-Washington-1')
print(speech_text[:500])

# Complete metadata table without loading the bodies
meta = sotu.metadata()
print(meta.head())

Coming from R `sotu` + quanteda

The R sotu package exposes sotu_meta (a metadata data frame) and sotu_text (a parallel character vector). Together with quanteda they let you do things like:

library(sotu); library(quanteda)
corp     <- corpus(sotu_text, docvars = sotu_meta)
spoken   <- corpus_subset(corp, sotu_type == "speech")
ndoc(corp)

The Python equivalents (working with the same UCSB source texts):

R / quanteda	Python (`sotu`)
`sotu_meta`	`sotu.metadata()`
`sotu_text`	`sotu.load(full=True)["text"]`
`corpus(sotu_text, docvars=sotu_meta)`	`sotu.load(full=True)` (single joined DataFrame)
`texts(corp)`	`df["text"].tolist()`
`docvars(corp)`	`df.drop(columns=["text"])`
`docnames(corp)`	`sotu.fileids()` or `df["fileid"]`
`corpus_subset(corp, sotu_type == "speech")`	`df[df.sotu_type == "spoken"]`
`ndoc(corp)`	`len(df)`
`as.character(corp[i])`	`sotu.raw(fileid)`

Two intentional differences from the R package:

sotu_type vocabulary: this package uses "spoken" / "written" (matching the consumer contract documented for the DH-Oz masterclass). The R package uses "speech" / "written". Convert with df["sotu_type"].replace({"spoken": "speech"}) if you need R parity.
president column: this package's default president is the surname only (e.g. "Washington", "Van Buren") for stable joins; president_full carries the full name in load(full=True). R uses the full name as the primary president field.

To hand the corpus to a Python NLP library like spaCy or Gensim:

import sotu, spacy

df = sotu.load(full=True)
nlp = spacy.load("en_core_web_sm")
for fileid, text in zip(df["fileid"], df["text"]):
    doc = nlp(text)
    # ... your analysis here

Data Preservation & Provenance

The text corpus is compiled directly from the authoritative scholarly archive at the UC Santa Barbara American Presidency Project (https://www.presidency.ucsb.edu/), curated by Gerhard Peters and John T. Woolley.

Unlike many scraped datasets this package:

Preserves the exact raw HTML source files in the repository under raw/ucsb/ for verification and academic reproducibility.
Ships a SHA-256 hash manifest (manifest.json) covering every parsed plain-text speech and every raw HTML source so byte-level determinism can be re-checked at any time.
Retains the R sotu CRAN package's classification rules (e.g. labelling George Washington's party as Nonpartisan and Andrew Johnson's as National Union) while providing an additive disambiguation layer (president_full, president_id, date, source_url).
Flags non-SOTU UCSB documents (radio summaries, Key West remarks, Nixon 1973 topical Special Messages) with is_sotu=False so scholars can study them without polluting the canonical SOTU view.

The build orchestrator (uv run python -m tools.build) is deterministic — two consecutive offline builds against the same raw/ucsb/ snapshot produce byte-identical metadata.csv, manifest.json, and speeches/*.txt files.

Citation

Always cite the UCSB American Presidency Project when using the text content from this corpus. The scholarly transcription, classification, and verification of these documents is their work.

Recommended citation

Peters, Gerhard, and John T. Woolley. The American Presidency Project. University of California, Santa Barbara. https://www.presidency.ucsb.edu/

BibTeX:

@misc{peters-woolley-presidency,
  author       = {Peters, Gerhard and Woolley, John T.},
  title        = {The American Presidency Project},
  organization = {University of California, Santa Barbara},
  url          = {https://www.presidency.ucsb.edu/}
}

A machine-readable CITATION.cff is provided at the repository root for tools that consume it (GitHub, Zotero, etc.).

If you specifically want to cite this Python packaging of the corpus, also reference the project repository (https://github.com/DH-Oz/sotu), but the primary citation belongs to UCSB.

Acknowledgements

Gerhard Peters and John T. Woolley for building and maintaining the American Presidency Project at UC Santa Barbara since 1999. Every text in this package is theirs; this project only reorganises their work for programmatic access.
The R sotu package authors for establishing the classification conventions (sotu_type, party assignments) that this Python package adopts.

License

The Python code and build system are licensed under the MIT License.
The SOTU speech texts themselves are works of the United States Government and reside in the public domain under 17 U.S.C. § 105.
The HTML markup and editorial structure provided by the UCSB American Presidency Project belong to UCSB; this package only redistributes the plain-text transcriptions they have made publicly available.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Denubis

These details have not been verified by PyPI

Project links

Source Archive (UCSB)

Release history Release notifications | RSS feed

This version

0.1.0

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sotu-0.1.0.tar.gz (10.9 MB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sotu-0.1.0-py3-none-any.whl (4.5 MB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file sotu-0.1.0.tar.gz.

File metadata

Download URL: sotu-0.1.0.tar.gz
Upload date: May 21, 2026
Size: 10.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sotu-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9397377ecf003ba637c88d3a36e862f1e88cea43177c3e478c37f5d63dbea49c`
MD5	`6dcf16f7518d3b2ce9d68396ded1b660`
BLAKE2b-256	`d16fa4db368e84c46309dd7996efdd1bf4dc7a3b7f99819f0732e4637a7f8bcf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sotu-0.1.0.tar.gz:

Publisher: release.yml on DH-Oz/sotu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sotu-0.1.0.tar.gz
- Subject digest: 9397377ecf003ba637c88d3a36e862f1e88cea43177c3e478c37f5d63dbea49c
- Sigstore transparency entry: 1589627234
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: DH-Oz/sotu@c267ae38c9eca243f1b575353f1c9fff6ecb582b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/DH-Oz
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@c267ae38c9eca243f1b575353f1c9fff6ecb582b
- Trigger Event: push

File details

Details for the file sotu-0.1.0-py3-none-any.whl.

File metadata

Download URL: sotu-0.1.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 4.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sotu-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c42781ece3709bf985153c34c489f5ebd9f972d0a775b9dc622cc233866f496`
MD5	`9227e37e1ff55deeb6c7823f16aa8b09`
BLAKE2b-256	`3804a969ac18deb24ce4e1a2e22775163de58ce7e55cae4a9dd3ff4f6c42b2a9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sotu-0.1.0-py3-none-any.whl:

Publisher: release.yml on DH-Oz/sotu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sotu-0.1.0-py3-none-any.whl
- Subject digest: 1c42781ece3709bf985153c34c489f5ebd9f972d0a775b9dc622cc233866f496
- Sigstore transparency entry: 1589627317
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: DH-Oz/sotu@c267ae38c9eca243f1b575353f1c9fff6ecb582b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/DH-Oz
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@c267ae38c9eca243f1b575353f1c9fff6ecb582b
- Trigger Event: push

sotu 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SOTU: U.S. State of the Union Addresses & Annual Messages (1790–present)

Quickstart

Installation

Basic Usage

Canonical SOTUs vs. related UCSB documents

Detailed Usage

NLTK-style corpus accessors

Coming from R sotu + quanteda

Data Preservation & Provenance

Citation

Recommended citation

Acknowledgements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Coming from R `sotu` + quanteda