Skip to main content

Scrape OpenStax math textbooks into AI-ready JSON (Markdown + LaTeX), with optional problem/solution pairs.

Project description

openstax-scraper

Turn OpenStax textbooks into AI-ready JSON: one clean Markdown document per content page — LaTeX math preserved, practice problems inline, with rich metadata and quality signals — plus an optional mode that harvests problem ↔ solution pairs. The output is newline-delimited JSON (JSONL), ready to chunk and embed for retrieval-augmented generation (RAG), fine-tuning datasets, search indexes, or analysis.

The package installs a single command-line tool, scrape_openstax, and is also usable as a library (import openstax_scraper).

Highlights

  • MathML → LaTeX for the full element set OpenStax math books use, wrapped as $…$ / $$…$$.
  • HTML → Markdown that preserves LaTeX through Markdown escaping, references images by absolute URL (never downloads them), and normalizes whitespace.
  • Polite, cached fetching — retry/backoff (429/5xx, Retry-After), per-host delay + jitter, a descriptive User-Agent, robots.txt enforcement, and a mandatory on-disk cache with a refresh interval (TTL).
  • Idempotent output — upsert-by-id with atomic rewrite, so re-running on an unchanged book is a no-op and a changed page updates its line in place.
  • Schema-validated — every record can be checked against bundled JSON Schemas before it is written (--validate).
  • Per-page error isolation — one bad page never aborts a whole book.

Installation

Requires Python 3.10+.

pip install openstax-scraper

Or from a clone, in editable mode with the dev tools (pytest, ruff, build):

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Dependencies

All runtime dependencies install automatically with the package:

Package Why it's needed
requests HTTP fetching
lxml HTML/MathML parsing
markdownify HTML → Markdown conversion
jsonschema --validate against the output schemas
langdetect language enrichment signal
yake offline keywords extraction (no LLM, no network)
platformdirs resolves the per-user cache directory
anthropic --get-qa solutions-manual discovery (an Anthropic API key is required only for that mode)

Usage

There are two things the tool does:

  1. Scrape a whole textbook into a single JSONL file with enriched metadata.
  2. Extract question & answer pairs from a textbook into a separate JSONL file.

Any in-book page URL works as the seed — it is normalized to the book's preface, from which the full table of contents is discovered.

Scrape a textbook into JSONL

Output is <output-dir>/<book>/page_contents.jsonl (plus a manifest.json).

# Whole book
scrape_openstax \
  --book-url=https://openstax.org/books/calculus-volume-1/pages/preface \
  --output-dir ./out --validate

# Selected chapters only (comma-separated, no spaces)
scrape_openstax \
  --book-url=https://openstax.org/books/calculus-volume-1/pages/1-1-review-of-functions \
  --chapters=1,2,3 --output-dir ./out --validate

Re-running is idempotent: an unchanged book is a no-op, a changed page updates its line in place. Add --dry-run to crawl and validate without writing.

Extract question-and-answer pairs

--get-qa is a separate mode that pairs each problem with its worked solution into <output-dir>/<book>/questions_and_answers.jsonl. It covers the whole book (ignores --chapters) and needs an Anthropic API key to discover the solutions-manual pages.

The key is resolved as --anthropic-api-key=<key> if given, otherwise the ANTHROPIC_API_KEY environment variable (read from a local .env automatically, or exported). If neither is set, the run exits with a fatal config error.

scrape_openstax \
  --book-url=https://openstax.org/books/chemistry-2e/pages/preface \
  --get-qa --output-dir ./out --validate

A book with no solutions manual produces no file and exits cleanly.

Parse a single saved page (offline)

scrape_openstax \
  --from-file page.html \
  --url https://openstax.org/books/calculus-volume-1/pages/1-1-review-of-functions \
  --output-dir ./out

Use as a library

from openstax_scraper.adapters.openstax import OpenStaxAdapter

adapter = OpenStaxAdapter()
page = adapter.parse_page(url, html)   # -> a PageRecord

Command-line arguments

Flag Meaning
--book-url=<URL> Any in-book page URL (/books/<book>/pages/...); crawls that book (normalized to its /pages/preface, whose TOC is discovered). Mutually exclusive with --from-file; one of the two is required.
--from-file=<path> Parse a single local HTML file instead of crawling (offline).
--url=<URL> The canonical URL to record when using --from-file (so id/source are right even though the bytes came from disk).
--output-dir=<dir> Where to write output. Default ./out, or $OPENSTAX_OUTPUT_DIR if set. Files: <dir>/<book>/page_contents.jsonl + manifest.json (or questions_and_answers.jsonl under --get-qa).
--chapters=<csv> Restrict the crawl to one or more chapters, comma-separated with no spaces (e.g. 11 or 1,2,3). Each keeps pages whose slug begins <n>- (e.g. 11-1, 11-2). Omit to crawl the whole book.
--get-qa Collect problem/solution pairs into questions_and_answers.jsonl instead of crawling pages. Covers the whole book (ignores --chapters). Needs an Anthropic API key. A book with no solutions manual yields an empty result.
--anthropic-api-key=<key> Anthropic API key for --get-qa. Takes precedence over the ANTHROPIC_API_KEY environment variable (and .env); if omitted, that variable is the fallback. If neither is set, --get-qa exits with a fatal error. Ignored outside --get-qa.
--delay=<sec> Politeness delay between requests to the same host (default 1.0, plus jitter).
--refresh-interval=<sec> How long a cached page stays fresh before it's re-fetched. Default 432000 (5 days). Caching is always on; see Cache location.
--no-robots Skip robots.txt consultation (default: obey it).
--keywords=<mode> heuristic (default, offline yake) or none.
--include-types=<csv> Keep only these content_types (e.g. textbook_section,chapter_intro). Default: all.
--validate Check every record against the bundled JSON Schemas before writing, and fail loudly if anything is off-contract. Recommended.
--dry-run Crawl + parse + validate, but write nothing. Great for CI.
--user-agent=<str> Override the polite identifying User-Agent sent on fetches.
--log-level=<lvl> Logging verbosity (DEBUG/INFO/WARNING/…). Default INFO.
-h, --help Print usage and exit.

Output schema

Newline-delimited JSON, one file per book. The authoritative contract is the bundled JSON Schemas in src/openstax_scraper/schemas/.

page_contents.jsonl — one object per content page:

field meaning
id sha1(url) — stable primary key (idempotency)
url, title page URL and section title
body_text cleaned Markdown with $…$ / $$…$$ LaTeX — the full page, practice problems included inline
source {site, book, book_title, chapter, section, page_slug}
content_type textbook_section | chapter_intro | chapter_summary | glossary | reference
char_count, word_count, math_density, n_images, image_urls structural quality signals
language, reading_time_min, keywords enrichment signals
content_hash, fetched_at, scraper_version provenance / change-detection

Why are problems kept inline rather than split into their own records? OpenStax problems have no dependable structure — groups, sub-problems, irregular numbering — so a reliable split would require an LLM. Instead they are treated as ordinary page content: converted to Markdown and left inline in body_text, in reading order.

questions_and_answers.jsonl (from --get-qa) — one object per pair:

field meaning
id sha1(question_url, fragment) — stable primary key
question, answer the problem and its worked solution, both Markdown + LaTeX
source {site, book, chapter, section, page_slug} of the question
question_url, answer_url where the problem is stated / where the solution lives
question_fragment, label the problem element's id; the solution's displayed number
content_hash, fetched_at, scraper_version provenance / change-detection

How it works

Parsing: HTML + MathML → Markdown

OpenStaxAdapter.parse_page(url, html) turns a page into a PageRecord:

  • MathML → LaTeX (mathml.py) for the element set OpenStax math books use (mi mn mo mrow msup msub msubsup mfrac msqrt mroot mtable …).
  • HTML → Markdown (htmlmd.py) that preserves LaTeX through Markdown escaping (sentinel substitution) and references images by absolute URL.
  • Page classification + routing (section / intro / summary / glossary / reference / skip). The entire content body — worked examples, in-text notes, and practice problems — is kept inline in one Markdown document.

Crawling and fetching

  • fetcher.py — site-agnostic HTTP with retry/backoff, a polite per-host delay + jitter, robots.txt enforcement, and a mandatory on-disk cache: entries are stored as <url-hash>-<epoch>.html and re-used until older than --refresh-interval, then re-fetched. Every fetch returns a FetchResult; errors are captured, never raised.
  • crawler.py — generic over the SiteAdapter: discovers the full book TOC from the seed page, builds an ordered, deduplicated frontier (optionally narrowed to --chapters), then fetches/classifies/parses/enriches each page with content-hash dedup and per-page error isolation.

Enrichment and idempotent output

  • enrich.py fills quality signals generically: language (langdetect), reading_time_min (word_count / 200), and offline keywords (yake).
  • writers.py writes idempotent JSONL (upsert-by-id, atomic rewrite) plus a per-book manifest.json of run metadata and counts.

How Q&A pairing works

Pairing problems with solutions on OpenStax is otherwise hopeless to hardcode: solutions manuals appear under inconsistent names and positions, often cover only some problems, and number them out of order. The trick is that every solution's number is a back-link to the problem it solves — an <a class="os-number" … data-page-slug="…" data-page-fragment="…">. So --get-qa:

  1. Discovers the solutions-manual pages with a cheap LLM — the model reads the TOC and returns the solution-page slugs (hallucinated slugs are dropped — only real TOC leaves survive). The prompt is bundled at prompts/discover_solutions.md.
  2. Fetches the whole book, then on each solutions page finds every os-number back-link, takes its parent as the answer, and follows the link to the problem element on its page. Both halves are converted to Markdown with the same MathML→LaTeX pass as page bodies.

Operational reference

Cache location

The on-disk page cache is a private speed/politeness optimization, not output, so it lives in one constant place and is reused across runs regardless of where you write JSONL. The directory is resolved as:

  1. $OPENSTAX_CACHE_DIR if set — an explicit override.
  2. otherwise the per-user cache dir for this app (via platformdirs):
    • Linux: ~/.cache/openstax-scraper (honors $XDG_CACHE_HOME)
    • macOS: ~/Library/Caches/openstax-scraper
    • Windows: %LOCALAPPDATA%\openstax-scraper\Cache

The cache is best-effort: an unwritable path just disables it rather than failing the run.

Exit codes

The CLI follows the rule "exit non-zero only on fatal config errors; per-page failures never fail the run."

Code Meaning
0 Ran to completion. Individual pages that 404, time out, or fail to parse are isolated, counted in the summary (failed=…), and do not change the exit code.
2 A fatal config/environment error stopped the run before useful output: --from-file path missing, the seed page couldn't be fetched (empty frontier), a record came out off-contract under --validate, or --get-qa had no API key. These print a one-line error instead of a traceback and write nothing. argparse also exits 2 on bad/missing flags.

Development

pip install -e ".[dev]"
pytest          # fully offline, against committed fixtures in tests/fixtures/
ruff check .

Project layout

src/openstax_scraper/
  mathml.py            # MathML → LaTeX
  htmlmd.py            # HTML → Markdown (math-aware)
  models.py            # PageRecord, QuestionAndAnswer data classes
  config.py            # runtime configuration (delay, get_qa, cache dir, …)
  enrich.py            # generic quality signals (language, reading time, keywords)
  fetcher.py           # site-agnostic HTTP: retry, throttle, cache, robots
  crawler.py           # orchestrator: TOC frontier → fetch/parse/enrich, dedup
  qa.py                # --get-qa orchestrator: pair problems with solutions
  llm.py               # tiny Anthropic wrapper (used by --get-qa)
  prompts.py           # locate/load bundled prompt templates
  writers.py           # idempotent upsert JSONL + manifest
  cli.py               # scrape_openstax entry point
  adapters/
    base.py            # SiteAdapter protocol + PageClass
    openstax.py        # all OpenStax-specific knowledge
  schemas/             # bundled JSON Schemas (output contract)
  prompts/             # bundled LLM prompt templates
scripts/               # diagnostic probes (live-site, dev-only)
tests/                 # offline tests + committed HTML fixtures

The adapter boundary keeps OpenStax specifics out of the generic pipeline: supporting a new site means adding one SiteAdapter, not touching the crawler.

License

MIT © Yoftahe Milkessa

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openstax_scraper-0.1.3.tar.gz (61.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openstax_scraper-0.1.3-py3-none-any.whl (50.1 kB view details)

Uploaded Python 3

File details

Details for the file openstax_scraper-0.1.3.tar.gz.

File metadata

  • Download URL: openstax_scraper-0.1.3.tar.gz
  • Upload date:
  • Size: 61.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.4

File hashes

Hashes for openstax_scraper-0.1.3.tar.gz
Algorithm Hash digest
SHA256 88820b173b46960889746e807c96eab2adac3e24a5394ed68d5cf5b458600eb2
MD5 d6a4e11d66d62a445dfc619fb92c5b5d
BLAKE2b-256 f4ccc0f4b249e0c705c1ca0a997ec98401f4b64fd1693b182f11f41aaefab822

See more details on using hashes here.

File details

Details for the file openstax_scraper-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for openstax_scraper-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 185e1309ac09f3ba7f720f0b88055e227a467239d394b8e615bebf28cfbd86b7
MD5 61d70f55b721da26881cff9bb60b6221
BLAKE2b-256 27150d1f457b5de49bfc223ffc8322cd43800b119508188b7d288251aae4c1d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page