Scrape OpenStax math textbooks into AI-ready JSON (Markdown + LaTeX), with optional problem/solution pairs.
Project description
openstax-scraper
Turn OpenStax textbooks into AI-ready JSON: one clean Markdown document per content page — LaTeX math preserved, practice problems inline, with rich metadata and quality signals — plus an optional mode that harvests problem ↔ solution pairs. The output is newline-delimited JSON (JSONL), ready to chunk and embed for retrieval-augmented generation (RAG), fine-tuning datasets, search indexes, or analysis.
The package installs a single command-line tool, scrape_openstax, and is also
usable as a library (import openstax_scraper).
Highlights
- MathML → LaTeX for the full element set OpenStax math books use, wrapped as
$…$/$$…$$. - HTML → Markdown that preserves LaTeX through Markdown escaping, references images by absolute URL (never downloads them), and normalizes whitespace.
- Polite, cached fetching — retry/backoff (429/5xx,
Retry-After), per-host delay + jitter, a descriptive User-Agent,robots.txtenforcement, and a mandatory on-disk cache with a refresh interval (TTL). - Idempotent output — upsert-by-
idwith atomic rewrite, so re-running on an unchanged book is a no-op and a changed page updates its line in place. - Schema-validated — every record can be checked against bundled JSON Schemas
before it is written (
--validate). - Per-page error isolation — one bad page never aborts a whole book.
Installation
Requires Python 3.10+.
pip install openstax-scraper
Or from a clone, in editable mode with the dev tools (pytest, ruff, build):
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
Dependencies
All runtime dependencies install automatically with the package:
| Package | Why it's needed |
|---|---|
requests |
HTTP fetching |
lxml |
HTML/MathML parsing |
markdownify |
HTML → Markdown conversion |
jsonschema |
--validate against the output schemas |
langdetect |
language enrichment signal |
yake |
offline keywords extraction (no LLM, no network) |
platformdirs |
resolves the per-user cache directory |
anthropic |
--get-qa solutions-manual discovery (an Anthropic API key is required only for that mode) |
Usage
There are two things the tool does:
- Scrape a whole textbook into a single JSONL file with enriched metadata.
- Extract question & answer pairs from a textbook into a separate JSONL file.
Any in-book page URL works as the seed — it is normalized to the book's preface, from which the full table of contents is discovered.
Scrape a textbook into JSONL
Output is <output-dir>/<book>/page_contents.jsonl (plus a manifest.json).
# Whole book
scrape_openstax \
--book-url=https://openstax.org/books/calculus-volume-1/pages/preface \
--output-dir ./out --validate
# Selected chapters only (comma-separated, no spaces)
scrape_openstax \
--book-url=https://openstax.org/books/calculus-volume-1/pages/1-1-review-of-functions \
--chapters=1,2,3 --output-dir ./out --validate
Re-running is idempotent: an unchanged book is a no-op, a changed page
updates its line in place. Add --dry-run to crawl and validate without writing.
Extract question-and-answer pairs
--get-qa is a separate mode that pairs each problem with its worked solution
into <output-dir>/<book>/questions_and_answers.jsonl. It covers the whole book
(ignores --chapters) and needs an Anthropic API key to discover the
solutions-manual pages.
The key is resolved as --anthropic-api-key=<key> if given, otherwise the
ANTHROPIC_API_KEY environment variable (read from a local .env automatically,
or exported). If neither is set, the run exits with a fatal config error.
scrape_openstax \
--book-url=https://openstax.org/books/chemistry-2e/pages/preface \
--get-qa --output-dir ./out --validate
A book with no solutions manual produces no file and exits cleanly.
Parse a single saved page (offline)
scrape_openstax \
--from-file page.html \
--url https://openstax.org/books/calculus-volume-1/pages/1-1-review-of-functions \
--output-dir ./out
Use as a library
from openstax_scraper.adapters.openstax import OpenStaxAdapter
adapter = OpenStaxAdapter()
page = adapter.parse_page(url, html) # -> a PageRecord
Command-line arguments
| Flag | Meaning |
|---|---|
--book-url=<URL> |
Any in-book page URL (/books/<book>/pages/...); crawls that book (normalized to its /pages/preface, whose TOC is discovered). Mutually exclusive with --from-file; one of the two is required. |
--from-file=<path> |
Parse a single local HTML file instead of crawling (offline). |
--url=<URL> |
The canonical URL to record when using --from-file (so id/source are right even though the bytes came from disk). |
--output-dir=<dir> |
Where to write output. Default ./out, or $OPENSTAX_OUTPUT_DIR if set. Files: <dir>/<book>/page_contents.jsonl + manifest.json (or questions_and_answers.jsonl under --get-qa). |
--chapters=<csv> |
Restrict the crawl to one or more chapters, comma-separated with no spaces (e.g. 11 or 1,2,3). Each keeps pages whose slug begins <n>- (e.g. 11-1, 11-2). Omit to crawl the whole book. |
--get-qa |
Collect problem/solution pairs into questions_and_answers.jsonl instead of crawling pages. Covers the whole book (ignores --chapters). Needs an Anthropic API key. A book with no solutions manual yields an empty result. |
--anthropic-api-key=<key> |
Anthropic API key for --get-qa. Takes precedence over the ANTHROPIC_API_KEY environment variable (and .env); if omitted, that variable is the fallback. If neither is set, --get-qa exits with a fatal error. Ignored outside --get-qa. |
--delay=<sec> |
Politeness delay between requests to the same host (default 1.0, plus jitter). |
--refresh-interval=<sec> |
How long a cached page stays fresh before it's re-fetched. Default 432000 (5 days). Caching is always on; see Cache location. |
--no-robots |
Skip robots.txt consultation (default: obey it). |
--keywords=<mode> |
heuristic (default, offline yake) or none. |
--include-types=<csv> |
Keep only these content_types (e.g. textbook_section,chapter_intro). Default: all. |
--validate |
Check every record against the bundled JSON Schemas before writing, and fail loudly if anything is off-contract. Recommended. |
--dry-run |
Crawl + parse + validate, but write nothing. Great for CI. |
--user-agent=<str> |
Override the polite identifying User-Agent sent on fetches. |
--log-level=<lvl> |
Logging verbosity (DEBUG/INFO/WARNING/…). Default INFO. |
-h, --help |
Print usage and exit. |
Output schema
Newline-delimited JSON, one file per book. The authoritative contract is the
bundled JSON Schemas in
src/openstax_scraper/schemas/.
page_contents.jsonl — one object per content page:
| field | meaning |
|---|---|
id |
sha1(url) — stable primary key (idempotency) |
url, title |
page URL and section title |
body_text |
cleaned Markdown with $…$ / $$…$$ LaTeX — the full page, practice problems included inline |
source |
{site, book, book_title, chapter, section, page_slug} |
content_type |
textbook_section | chapter_intro | chapter_summary | glossary | reference |
char_count, word_count, math_density, n_images, image_urls |
structural quality signals |
language, reading_time_min, keywords |
enrichment signals |
content_hash, fetched_at, scraper_version |
provenance / change-detection |
Why are problems kept inline rather than split into their own records? OpenStax problems have no dependable structure — groups, sub-problems, irregular numbering — so a reliable split would require an LLM. Instead they are treated as ordinary page content: converted to Markdown and left inline in
body_text, in reading order.
questions_and_answers.jsonl (from --get-qa) — one object per pair:
| field | meaning |
|---|---|
id |
sha1(question_url, fragment) — stable primary key |
question, answer |
the problem and its worked solution, both Markdown + LaTeX |
source |
{site, book, chapter, section, page_slug} of the question |
question_url, answer_url |
where the problem is stated / where the solution lives |
question_fragment, label |
the problem element's id; the solution's displayed number |
content_hash, fetched_at, scraper_version |
provenance / change-detection |
How it works
Parsing: HTML + MathML → Markdown
OpenStaxAdapter.parse_page(url, html) turns a page into a PageRecord:
- MathML → LaTeX (
mathml.py) for the element set OpenStax math books use (mi mn mo mrow msup msub msubsup mfrac msqrt mroot mtable …). - HTML → Markdown (
htmlmd.py) that preserves LaTeX through Markdown escaping (sentinel substitution) and references images by absolute URL. - Page classification + routing (section / intro / summary / glossary / reference / skip). The entire content body — worked examples, in-text notes, and practice problems — is kept inline in one Markdown document.
Crawling and fetching
fetcher.py— site-agnostic HTTP with retry/backoff, a polite per-host delay + jitter,robots.txtenforcement, and a mandatory on-disk cache: entries are stored as<url-hash>-<epoch>.htmland re-used until older than--refresh-interval, then re-fetched. Every fetch returns aFetchResult; errors are captured, never raised.crawler.py— generic over theSiteAdapter: discovers the full book TOC from the seed page, builds an ordered, deduplicated frontier (optionally narrowed to--chapters), then fetches/classifies/parses/enriches each page with content-hash dedup and per-page error isolation.
Enrichment and idempotent output
enrich.pyfills quality signals generically:language(langdetect),reading_time_min(word_count / 200), and offlinekeywords(yake).writers.pywrites idempotent JSONL (upsert-by-id, atomic rewrite) plus a per-bookmanifest.jsonof run metadata and counts.
How Q&A pairing works
Pairing problems with solutions on OpenStax is otherwise hopeless to hardcode:
solutions manuals appear under inconsistent names and positions, often cover only
some problems, and number them out of order. The trick is that every
solution's number is a back-link to the problem it solves — an
<a class="os-number" … data-page-slug="…" data-page-fragment="…">. So --get-qa:
- Discovers the solutions-manual pages with a cheap LLM — the model reads the
TOC and returns the solution-page slugs (hallucinated slugs are dropped — only
real TOC leaves survive). The prompt is bundled at
prompts/discover_solutions.md. - Fetches the whole book, then on each solutions page finds every
os-numberback-link, takes its parent as the answer, and follows the link to the problem element on its page. Both halves are converted to Markdown with the same MathML→LaTeX pass as page bodies.
Operational reference
Cache location
The on-disk page cache is a private speed/politeness optimization, not output, so it lives in one constant place and is reused across runs regardless of where you write JSONL. The directory is resolved as:
$OPENSTAX_CACHE_DIRif set — an explicit override.- otherwise the per-user cache dir for this app (via
platformdirs):- Linux:
~/.cache/openstax-scraper(honors$XDG_CACHE_HOME) - macOS:
~/Library/Caches/openstax-scraper - Windows:
%LOCALAPPDATA%\openstax-scraper\Cache
- Linux:
The cache is best-effort: an unwritable path just disables it rather than failing the run.
Exit codes
The CLI follows the rule "exit non-zero only on fatal config errors; per-page failures never fail the run."
| Code | Meaning |
|---|---|
0 |
Ran to completion. Individual pages that 404, time out, or fail to parse are isolated, counted in the summary (failed=…), and do not change the exit code. |
2 |
A fatal config/environment error stopped the run before useful output: --from-file path missing, the seed page couldn't be fetched (empty frontier), a record came out off-contract under --validate, or --get-qa had no API key. These print a one-line error instead of a traceback and write nothing. argparse also exits 2 on bad/missing flags. |
Development
pip install -e ".[dev]"
pytest # fully offline, against committed fixtures in tests/fixtures/
ruff check .
Project layout
src/openstax_scraper/
mathml.py # MathML → LaTeX
htmlmd.py # HTML → Markdown (math-aware)
models.py # PageRecord, QuestionAndAnswer data classes
config.py # runtime configuration (delay, get_qa, cache dir, …)
enrich.py # generic quality signals (language, reading time, keywords)
fetcher.py # site-agnostic HTTP: retry, throttle, cache, robots
crawler.py # orchestrator: TOC frontier → fetch/parse/enrich, dedup
qa.py # --get-qa orchestrator: pair problems with solutions
llm.py # tiny Anthropic wrapper (used by --get-qa)
prompts.py # locate/load bundled prompt templates
writers.py # idempotent upsert JSONL + manifest
cli.py # scrape_openstax entry point
adapters/
base.py # SiteAdapter protocol + PageClass
openstax.py # all OpenStax-specific knowledge
schemas/ # bundled JSON Schemas (output contract)
prompts/ # bundled LLM prompt templates
scripts/ # diagnostic probes (live-site, dev-only)
tests/ # offline tests + committed HTML fixtures
The adapter boundary keeps OpenStax specifics out of the generic pipeline:
supporting a new site means adding one SiteAdapter, not touching the crawler.
License
MIT © Yoftahe Milkessa
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openstax_scraper-0.1.3.tar.gz.
File metadata
- Download URL: openstax_scraper-0.1.3.tar.gz
- Upload date:
- Size: 61.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88820b173b46960889746e807c96eab2adac3e24a5394ed68d5cf5b458600eb2
|
|
| MD5 |
d6a4e11d66d62a445dfc619fb92c5b5d
|
|
| BLAKE2b-256 |
f4ccc0f4b249e0c705c1ca0a997ec98401f4b64fd1693b182f11f41aaefab822
|
File details
Details for the file openstax_scraper-0.1.3-py3-none-any.whl.
File metadata
- Download URL: openstax_scraper-0.1.3-py3-none-any.whl
- Upload date:
- Size: 50.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
185e1309ac09f3ba7f720f0b88055e227a467239d394b8e615bebf28cfbd86b7
|
|
| MD5 |
61d70f55b721da26881cff9bb60b6221
|
|
| BLAKE2b-256 |
27150d1f457b5de49bfc223ffc8322cd43800b119508188b7d288251aae4c1d2
|