Skip to main content

Download and organise PDFs from Physics & Maths Tutor pages

Project description

pmt-scraper

Download and organise PDFs from Physics & Maths Tutor pages.

Point it at any PMT page that lists PDF links and it scrapes every PDF, sorts them into folders, and downloads them politely (rate-limited, resumable, skips existing files).

Install

pip install requests beautifulsoup4

Usage

python pmt_scrape.py <url> [options]

Options

Output

Flag Default Description
--out <dir> downloads Root output folder
--organise heading Group by section heading on the page
--organise path Mirror PMT's own folder structure
--organise flat All files in one folder
--delay <secs> 1.0 Pause between downloads (be polite)
--dry-run Print what would be saved, download nothing

Filtering

Flag Description
--keywords k1 k2 … Filter by keywords — see syntax below
--years y1 y2 … Keep only PDFs mentioning any of these years
--year-range FROM TO Keep only PDFs whose year falls within FROM–TO (inclusive)

--years and --year-range can be used together; both constraints must pass (AND).

Keyword syntax — prefix each token to control how it matches:

Prefix Meaning
word or +word Must be present (positive)
-word Must be absent (negative)

Matching is case-insensitive and searches the section heading, link text, and filename. Years embedded in PMT's URL paths (e.g. .../2019/...) are detected automatically. Undated files are always kept.

Examples

# All papers, grouped by heading
python pmt_scrape.py https://www.physicsandmathstutor.com/maths-revision/a-level-papers/

# Mark schemes only
python pmt_scrape.py <url> --keywords "mark scheme"

# Mark schemes only (positive keyword)
python pmt_scrape.py <url> --keywords +markscheme

# Mark schemes, excluding question papers
python pmt_scrape.py <url> --keywords +markscheme -questions

# Papers from 2018 to 2022
python pmt_scrape.py <url> --year-range 2018 2022

# Mark schemes for specific years (combine --years and --year-range)
python pmt_scrape.py <url> --keywords +markscheme --years 2019 2021 2023 --year-range 2019 2023

# Paper 1 only, no mark schemes, preview before downloading
python pmt_scrape.py <url> --keywords +paper1 -markscheme --dry-run

# Mirror PMT's folder structure
python pmt_scrape.py <url> --organise path

Project structure

pmt scraper/
├── pmt_scrape.py          # entry point
├── pmt_scraper/
│   ├── __init__.py
│   ├── cli.py             # argument parsing and main loop
│   ├── scraper.py         # page fetching and PDF link extraction
│   ├── downloader.py      # file download and output path logic
│   ├── filters.py         # keyword and year filtering
│   └── utils.py           # filename sanitisation, URL helpers
└── downloads/             # default output folder

Notes

  • Downloads use a .part suffix until complete — interrupted runs are safe to resume.
  • Files already present (non-zero size) are skipped automatically.
  • Pages that load links via JavaScript will not work; PMT's static pages are fine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pmt_scraper-0.1.0.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pmt_scraper-0.1.0-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file pmt_scraper-0.1.0.tar.gz.

File metadata

  • Download URL: pmt_scraper-0.1.0.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pmt_scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f6ac24c67813c160eee3126f0bff7958c89ae65ffc7365ab4d7505ebeead6196
MD5 e1b777969be815ec3ae29a7f86300e40
BLAKE2b-256 f2fc7aac9c5c8847283abcfdd46cfdba44c9a0f7e2fd2b5261609fd069bc6d9b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pmt_scraper-0.1.0.tar.gz:

Publisher: publish.yml on yvanlok/pmt-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pmt_scraper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pmt_scraper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pmt_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 457133ced104c539a857d3756b041dc4fbc7a945930155b1da5c030b9e451fb3
MD5 9f9c9a6fea75048b66260b64bc44c177
BLAKE2b-256 eee12c0fba6c37eabd1d529c718cf3a1b2ebe9de9cc4317d958915ec07f5d3e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pmt_scraper-0.1.0-py3-none-any.whl:

Publisher: publish.yml on yvanlok/pmt-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page