Scrape OpenStax math textbooks into AI-ready JSON (Markdown + LaTeX), with optional problem/solution pairs.

These details have not been verified by PyPI

Project links

Project description

openstax-scraper

Turn OpenStax textbooks into AI-ready JSON: one clean Markdown document per content page — LaTeX math preserved, practice problems inline, with rich metadata and quality signals — plus an optional mode that harvests problem ↔ solution pairs. The output is newline-delimited JSON (JSONL), ready to chunk and embed for retrieval-augmented generation (RAG), fine-tuning datasets, search indexes, or analysis.

The package installs a single command-line tool, scrape_openstax, and is also usable as a library (import openstax_scraper).

Highlights

MathML → LaTeX for the full element set OpenStax math books use, wrapped as $…$ / $$…$$.
HTML → Markdown that preserves LaTeX through Markdown escaping, references images by absolute URL (never downloads them), and normalizes whitespace.
Polite, cached fetching — retry/backoff (429/5xx, Retry-After), per-host delay + jitter, a descriptive User-Agent, robots.txt enforcement, and a mandatory on-disk cache with a refresh interval (TTL).
Idempotent output — upsert-by-id with atomic rewrite, so re-running on an unchanged book is a no-op and a changed page updates its line in place.
Schema-validated — every record can be checked against bundled JSON Schemas before it is written (--validate).
Per-page error isolation — one bad page never aborts a whole book.

Installation

Requires Python 3.10+.

pip install openstax-scraper

Or from a clone, in editable mode with the dev tools (pytest, ruff, build):

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Dependencies

All runtime dependencies install automatically with the package:

Package	Why it's needed
`requests`	HTTP fetching
`lxml`	HTML/MathML parsing
`markdownify`	HTML → Markdown conversion
`jsonschema`	`--validate` against the output schemas
`langdetect`	`language` enrichment signal
`yake`	offline `keywords` extraction (no LLM, no network)
`platformdirs`	resolves the per-user cache directory
`anthropic`	`--get-qa` solutions-manual discovery (an Anthropic API key is required only for that mode)

Usage

There are two things the tool does:

Scrape a whole textbook into a single JSONL file with enriched metadata.
Extract question & answer pairs from a textbook into a separate JSONL file.

Any in-book page URL works as the seed — it is normalized to the book's preface, from which the full table of contents is discovered.

Scrape a textbook into JSONL

Output is <output-dir>/<book>/page_contents.jsonl (plus a manifest.json).

# Whole book
scrape_openstax \
  --book-url=https://openstax.org/books/calculus-volume-1/pages/preface \
  --output-dir ./out --validate

# Selected chapters only (comma-separated, no spaces)
scrape_openstax \
  --book-url=https://openstax.org/books/calculus-volume-1/pages/1-1-review-of-functions \
  --chapters=1,2,3 --output-dir ./out --validate

Re-running is idempotent: an unchanged book is a no-op, a changed page updates its line in place. Add --dry-run to crawl and validate without writing.

Extract question-and-answer pairs

--get-qa is a separate mode that pairs each problem with its worked solution into <output-dir>/<book>/questions_and_answers.jsonl. It covers the whole book (ignores --chapters) and needs an Anthropic API key to discover the solutions-manual pages.

The key is resolved as --anthropic-api-key=<key> if given, otherwise the ANTHROPIC_API_KEY environment variable (read from a local .env automatically, or exported). If neither is set, the run exits with a fatal config error.

scrape_openstax \
  --book-url=https://openstax.org/books/chemistry-2e/pages/preface \
  --get-qa --output-dir ./out --validate

A book with no solutions manual produces no file and exits cleanly.

Parse a single saved page (offline)

scrape_openstax \
  --from-file page.html \
  --url https://openstax.org/books/calculus-volume-1/pages/1-1-review-of-functions \
  --output-dir ./out

Use as a library

from openstax_scraper.adapters.openstax import OpenStaxAdapter

adapter = OpenStaxAdapter()
page = adapter.parse_page(url, html)   # -> a PageRecord

Command-line arguments

Flag	Meaning
`--book-url=<URL>`	Any in-book page URL (`/books/<book>/pages/...`); crawls that book (normalized to its `/pages/preface`, whose TOC is discovered). Mutually exclusive with `--from-file`; one of the two is required.
`--from-file=<path>`	Parse a single local HTML file instead of crawling (offline).
`--url=<URL>`	The canonical URL to record when using `--from-file` (so `id`/`source` are right even though the bytes came from disk).
`--output-dir=<dir>`	Where to write output. Default `./out`, or `$OPENSTAX_OUTPUT_DIR` if set. Files: `<dir>/<book>/page_contents.jsonl` + `manifest.json` (or `questions_and_answers.jsonl` under `--get-qa`).
`--chapters=<csv>`	Restrict the crawl to one or more chapters, comma-separated with no spaces (e.g. `11` or `1,2,3`). Each keeps pages whose slug begins `<n>-` (e.g. `11-1`, `11-2`). Omit to crawl the whole book.
`--get-qa`	Collect problem/solution pairs into `questions_and_answers.jsonl` instead of crawling pages. Covers the whole book (ignores `--chapters`). Needs an Anthropic API key. A book with no solutions manual yields an empty result.
`--anthropic-api-key=<key>`	Anthropic API key for `--get-qa`. Takes precedence over the `ANTHROPIC_API_KEY` environment variable (and `.env`); if omitted, that variable is the fallback. If neither is set, `--get-qa` exits with a fatal error. Ignored outside `--get-qa`.
`--delay=<sec>`	Politeness delay between requests to the same host (default `1.0`, plus jitter).
`--refresh-interval=<sec>`	How long a cached page stays fresh before it's re-fetched. Default `432000` (5 days). Caching is always on; see Cache location.
`--no-robots`	Skip `robots.txt` consultation (default: obey it).
`--keywords=<mode>`	`heuristic` (default, offline `yake`) or `none`.
`--include-types=<csv>`	Keep only these `content_type`s (e.g. `textbook_section,chapter_intro`). Default: all.
`--validate`	Check every record against the bundled JSON Schemas before writing, and fail loudly if anything is off-contract. Recommended.
`--dry-run`	Crawl + parse + validate, but write nothing. Great for CI.
`--user-agent=<str>`	Override the polite identifying User-Agent sent on fetches.
`--log-level=<lvl>`	Logging verbosity (`DEBUG`/`INFO`/`WARNING`/…). Default `INFO`.
`-h`, `--help`	Print usage and exit.

Output schema

Newline-delimited JSON, one file per book. The authoritative contract is the bundled JSON Schemas in src/openstax_scraper/schemas/.

page_contents.jsonl — one object per content page:

field	meaning
`id`	`sha1(url)` — stable primary key (idempotency)
`url`, `title`	page URL and section title
`body_text`	cleaned Markdown with $…$ / `$$…$$` LaTeX — the full page, practice problems included inline
`source`	`{site, book, book_title, chapter, section, page_slug}`
`content_type`	`textbook_section` \| `chapter_intro` \| `chapter_summary` \| `glossary` \| `reference`
`char_count`, `word_count`, `math_density`, `n_images`, `image_urls`	structural quality signals
`language`, `reading_time_min`, `keywords`	enrichment signals
`content_hash`, `fetched_at`, `scraper_version`	provenance / change-detection

Why are problems kept inline rather than split into their own records? OpenStax problems have no dependable structure — groups, sub-problems, irregular numbering — so a reliable split would require an LLM. Instead they are treated as ordinary page content: converted to Markdown and left inline in body_text, in reading order.

questions_and_answers.jsonl (from --get-qa) — one object per pair:

field	meaning
`id`	`sha1(question_url, fragment)` — stable primary key
`question`, `answer`	the problem and its worked solution, both Markdown + LaTeX
`source`	`{site, book, chapter, section, page_slug}` of the question
`question_url`, `answer_url`	where the problem is stated / where the solution lives
`question_fragment`, `label`	the problem element's id; the solution's displayed number
`content_hash`, `fetched_at`, `scraper_version`	provenance / change-detection

How it works

Parsing: HTML + MathML → Markdown

OpenStaxAdapter.parse_page(url, html) turns a page into a PageRecord:

MathML → LaTeX (mathml.py) for the element set OpenStax math books use (mi mn mo mrow msup msub msubsup mfrac msqrt mroot mtable …).
HTML → Markdown (htmlmd.py) that preserves LaTeX through Markdown escaping (sentinel substitution) and references images by absolute URL.
Page classification + routing (section / intro / summary / glossary / reference / skip). The entire content body — worked examples, in-text notes, and practice problems — is kept inline in one Markdown document.

Crawling and fetching

fetcher.py — site-agnostic HTTP with retry/backoff, a polite per-host delay + jitter, robots.txt enforcement, and a mandatory on-disk cache: entries are stored as <url-hash>-<epoch>.html and re-used until older than --refresh-interval, then re-fetched. Every fetch returns a FetchResult; errors are captured, never raised.
crawler.py — generic over the SiteAdapter: discovers the full book TOC from the seed page, builds an ordered, deduplicated frontier (optionally narrowed to --chapters), then fetches/classifies/parses/enriches each page with content-hash dedup and per-page error isolation.

Enrichment and idempotent output

enrich.py fills quality signals generically: language (langdetect), reading_time_min (word_count / 200), and offline keywords (yake).
writers.py writes idempotent JSONL (upsert-by-id, atomic rewrite) plus a per-book manifest.json of run metadata and counts.

How Q&A pairing works

Pairing problems with solutions on OpenStax is otherwise hopeless to hardcode: solutions manuals appear under inconsistent names and positions, often cover only some problems, and number them out of order. The trick is that every solution's number is a back-link to the problem it solves — an <a class="os-number" … data-page-slug="…" data-page-fragment="…">. So --get-qa:

Discovers the solutions-manual pages with a cheap LLM — the model reads the TOC and returns the solution-page slugs (hallucinated slugs are dropped — only real TOC leaves survive). The prompt is bundled at prompts/discover_solutions.md.
Fetches the whole book, then on each solutions page finds every os-number back-link, takes its parent as the answer, and follows the link to the problem element on its page. Both halves are converted to Markdown with the same MathML→LaTeX pass as page bodies.

Operational reference

Cache location

The on-disk page cache is a private speed/politeness optimization, not output, so it lives in one constant place and is reused across runs regardless of where you write JSONL. The directory is resolved as:

$OPENSTAX_CACHE_DIR if set — an explicit override.
otherwise the per-user cache dir for this app (via platformdirs):
- Linux: ~/.cache/openstax-scraper (honors $XDG_CACHE_HOME)
- macOS: ~/Library/Caches/openstax-scraper
- Windows: %LOCALAPPDATA%\openstax-scraper\Cache

The cache is best-effort: an unwritable path just disables it rather than failing the run.

Exit codes

The CLI follows the rule "exit non-zero only on fatal config errors; per-page failures never fail the run."

Code	Meaning
`0`	Ran to completion. Individual pages that 404, time out, or fail to parse are isolated, counted in the summary (`failed=…`), and do not change the exit code.
`2`	A fatal config/environment error stopped the run before useful output: `--from-file` path missing, the seed page couldn't be fetched (empty frontier), a record came out off-contract under `--validate`, or `--get-qa` had no API key. These print a one-line error instead of a traceback and write nothing. `argparse` also exits `2` on bad/missing flags.

Development

pip install -e ".[dev]"
pytest          # fully offline, against committed fixtures in tests/fixtures/
ruff check .

Project layout

src/openstax_scraper/
  mathml.py            # MathML → LaTeX
  htmlmd.py            # HTML → Markdown (math-aware)
  models.py            # PageRecord, QuestionAndAnswer data classes
  config.py            # runtime configuration (delay, get_qa, cache dir, …)
  enrich.py            # generic quality signals (language, reading time, keywords)
  fetcher.py           # site-agnostic HTTP: retry, throttle, cache, robots
  crawler.py           # orchestrator: TOC frontier → fetch/parse/enrich, dedup
  qa.py                # --get-qa orchestrator: pair problems with solutions
  llm.py               # tiny Anthropic wrapper (used by --get-qa)
  prompts.py           # locate/load bundled prompt templates
  writers.py           # idempotent upsert JSONL + manifest
  cli.py               # scrape_openstax entry point
  adapters/
    base.py            # SiteAdapter protocol + PageClass
    openstax.py        # all OpenStax-specific knowledge
  schemas/             # bundled JSON Schemas (output contract)
  prompts/             # bundled LLM prompt templates
scripts/               # diagnostic probes (live-site, dev-only)
tests/                 # offline tests + committed HTML fixtures

The adapter boundary keeps OpenStax specifics out of the generic pipeline: supporting a new site means adding one SiteAdapter, not touching the crawler.

License

MIT © Yoftahe Milkessa

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Jun 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openstax_scraper-0.1.3.tar.gz (61.7 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openstax_scraper-0.1.3-py3-none-any.whl (50.1 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file openstax_scraper-0.1.3.tar.gz.

File metadata

Download URL: openstax_scraper-0.1.3.tar.gz
Upload date: Jun 5, 2026
Size: 61.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.4

File hashes

Hashes for openstax_scraper-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`88820b173b46960889746e807c96eab2adac3e24a5394ed68d5cf5b458600eb2`
MD5	`d6a4e11d66d62a445dfc619fb92c5b5d`
BLAKE2b-256	`f4ccc0f4b249e0c705c1ca0a997ec98401f4b64fd1693b182f11f41aaefab822`

See more details on using hashes here.

File details

Details for the file openstax_scraper-0.1.3-py3-none-any.whl.

File metadata

Download URL: openstax_scraper-0.1.3-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 50.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.4

File hashes

Hashes for openstax_scraper-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`185e1309ac09f3ba7f720f0b88055e227a467239d394b8e615bebf28cfbd86b7`
MD5	`61d70f55b721da26881cff9bb60b6221`
BLAKE2b-256	`27150d1f457b5de49bfc223ffc8322cd43800b119508188b7d288251aae4c1d2`

See more details on using hashes here.

openstax-scraper 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

openstax-scraper

Highlights

Installation

Dependencies

Usage

Scrape a textbook into JSONL

Extract question-and-answer pairs

Parse a single saved page (offline)

Use as a library

Command-line arguments

Output schema

How it works

Parsing: HTML + MathML → Markdown

Crawling and fetching

Enrichment and idempotent output

How Q&A pairing works

Operational reference

Cache location

Exit codes

Development

Project layout

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes