Download and organise PDFs from Physics & Maths Tutor pages
Project description
pmt-scraper
Download and organise PDFs from Physics & Maths Tutor pages.
Point it at any PMT page that lists PDF links and it scrapes every PDF, sorts them into folders, and downloads them politely (rate-limited, resumable, skips existing files).
Install
pip install requests beautifulsoup4
Usage
python pmt_scrape.py <url> [options]
Options
Output
| Flag | Default | Description |
|---|---|---|
--out <dir> |
downloads |
Root output folder |
--organise heading |
✓ | Group by section heading on the page |
--organise path |
Mirror PMT's own folder structure | |
--organise flat |
All files in one folder | |
--delay <secs> |
1.0 |
Pause between downloads (be polite) |
--dry-run |
Print what would be saved, download nothing |
Filtering
| Flag | Description |
|---|---|
--keywords k1 k2 … |
Filter by keywords — see syntax below |
--years y1 y2 … |
Keep only PDFs mentioning any of these years |
--year-range FROM TO |
Keep only PDFs whose year falls within FROM–TO (inclusive) |
--years and --year-range can be used together; both constraints must pass (AND).
Keyword syntax — prefix each token to control how it matches:
| Prefix | Meaning |
|---|---|
word or +word |
Must be present (positive) |
-word |
Must be absent (negative) |
Matching is case-insensitive and searches the section heading, link text, and filename.
Years embedded in PMT's URL paths (e.g. .../2019/...) are detected automatically.
Undated files are always kept.
Examples
# All papers, grouped by heading
python pmt_scrape.py https://www.physicsandmathstutor.com/maths-revision/a-level-papers/
# Mark schemes only
python pmt_scrape.py <url> --keywords "mark scheme"
# Mark schemes only (positive keyword)
python pmt_scrape.py <url> --keywords +markscheme
# Mark schemes, excluding question papers
python pmt_scrape.py <url> --keywords +markscheme -questions
# Papers from 2018 to 2022
python pmt_scrape.py <url> --year-range 2018 2022
# Mark schemes for specific years (combine --years and --year-range)
python pmt_scrape.py <url> --keywords +markscheme --years 2019 2021 2023 --year-range 2019 2023
# Paper 1 only, no mark schemes, preview before downloading
python pmt_scrape.py <url> --keywords +paper1 -markscheme --dry-run
# Mirror PMT's folder structure
python pmt_scrape.py <url> --organise path
Project structure
pmt scraper/
├── pmt_scrape.py # entry point
├── pmt_scraper/
│ ├── __init__.py
│ ├── cli.py # argument parsing and main loop
│ ├── scraper.py # page fetching and PDF link extraction
│ ├── downloader.py # file download and output path logic
│ ├── filters.py # keyword and year filtering
│ └── utils.py # filename sanitisation, URL helpers
└── downloads/ # default output folder
Notes
- Downloads use a
.partsuffix until complete — interrupted runs are safe to resume. - Files already present (non-zero size) are skipped automatically.
- Pages that load links via JavaScript will not work; PMT's static pages are fine.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pmt_scraper-0.1.0.tar.gz.
File metadata
- Download URL: pmt_scraper-0.1.0.tar.gz
- Upload date:
- Size: 7.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6ac24c67813c160eee3126f0bff7958c89ae65ffc7365ab4d7505ebeead6196
|
|
| MD5 |
e1b777969be815ec3ae29a7f86300e40
|
|
| BLAKE2b-256 |
f2fc7aac9c5c8847283abcfdd46cfdba44c9a0f7e2fd2b5261609fd069bc6d9b
|
Provenance
The following attestation bundles were made for pmt_scraper-0.1.0.tar.gz:
Publisher:
publish.yml on yvanlok/pmt-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pmt_scraper-0.1.0.tar.gz -
Subject digest:
f6ac24c67813c160eee3126f0bff7958c89ae65ffc7365ab4d7505ebeead6196 - Sigstore transparency entry: 1722727558
- Sigstore integration time:
-
Permalink:
yvanlok/pmt-scraper@4ae9894f474cc947428f24d69915372aa3049b33 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/yvanlok
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4ae9894f474cc947428f24d69915372aa3049b33 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pmt_scraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pmt_scraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
457133ced104c539a857d3756b041dc4fbc7a945930155b1da5c030b9e451fb3
|
|
| MD5 |
9f9c9a6fea75048b66260b64bc44c177
|
|
| BLAKE2b-256 |
eee12c0fba6c37eabd1d529c718cf3a1b2ebe9de9cc4317d958915ec07f5d3e7
|
Provenance
The following attestation bundles were made for pmt_scraper-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on yvanlok/pmt-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pmt_scraper-0.1.0-py3-none-any.whl -
Subject digest:
457133ced104c539a857d3756b041dc4fbc7a945930155b1da5c030b9e451fb3 - Sigstore transparency entry: 1722727655
- Sigstore integration time:
-
Permalink:
yvanlok/pmt-scraper@4ae9894f474cc947428f24d69915372aa3049b33 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/yvanlok
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4ae9894f474cc947428f24d69915372aa3049b33 -
Trigger Event:
push
-
Statement type: