Module for scraping UC Merced's class schedules
Project description
# UCMercedule: Scraper
A Python module that scrapes [UC Merced class schedules][1] for you!
## API
Using this module pretty much just entails 1. creating a Schedule instance and
2. reading its data attributes; see below for more details.
### `ucmscraper.Schedule`
A `Schedule` instance object is a fully parsed UC Merced schedule page from a
given term.
The `Schedule` class is a [record type/plain old data structure][4], meaning it
really only structures data into fields and provides very little functionality
on its own. The `Term`, `Course`, and `Section` classes that compose `Schedule`
follow the same vein. It is up to the client to implement their own functions
for handling these types.
`Schedule`s can created in three ways: two involve a factory class method, and
one is a plain constructor.
#### 1. `ucmscraper.Schedule.fetch_latest()`
Performs an HTTP request and, if successful, returns a Schedule object for the
latest term (Fall 2019 at the time of writing).
#### 2. `ucmscraper.Schedule.fetch(term)`
Performs an HTTP request and, if successful, returns a Schedule object for the
given `Term` object. `Term`s should be retrieved via `ucmscraper.get_terms()`.
#### 3. `ucmscraper.Schedule(schedule_html)`
Parses `schedule_html` and returns a Schedule object.
#### Attributes
Schedule has the following data attributes:
`schedule.html` - a string of the raw HTML of the original schedule page
`schedule.term` - a `Term` object containing information about the term
associated with this `Schedule` instance.
`schedule.departments` - an [OrderedDict][2] whose keys are department codes and
whose values are the associated department titles, e.g.:
```
{
'ANTH': 'Anthropology',
'BEST': 'Bio Engin Small Scale Tech',
'BIO': 'Biological Sciences',
'BIOE': 'Bioengineering',
...
}
```
Keys follow the order that they appear in schedule pages, which is alphabetical.
`schedule.courses` - a tuple of `Course` [namedtuples](3) in the order that
courses appear on the schedule page, e.g.
```
(
Course(
department_code='ANTH',
number='001',
title='Sociocultural Anthropology',
units=4
),
...
Course(
department_code='WRI',
number='131C',
title='Undergraduate Research Journal',
units=2
)
)
```
`schedule.sections` - a tuple of `Section` [namedtuples](3), each representing
one non-exam row from the schedule page, and in the order that sections appear
on the schedule page, e.g.:
```
(
Section(
CRN=30250,
department_code='ANTH',
course_number='001',
number='01',
title='Sociocultural Anthropology',
notes=('Must Also Register For A Corresponding Discussion',),
activity='LECT',
days='MW',
start_time='1:30 PM',
end_time='2:45 PM',
location='ACS 120',
instructor='DeLugan, Robin',
max_seats=210,
taken_seats=0,
free_seats=210
),
...
Section(
CRN=34978,
department_code='WRI',
course_number='131C',
number='01',
title='Undergraduate Research Journal',
notes=(),
activity='SEM',
days='W',
start_time='9:30 AM',
end_time='11:20 AM',
location='CLSSRM 272',
instructor='Staff',
max_seats=20,
taken_seats=0,
free_seats=20
)
)
```
### `ucmscraper.get_terms()`
When first called, performs an HTTP request and if successful, returns an
an [OrderedDict][2] of terms currently available for viewing via the
[official schedule search form][1]. Keys are `validterm` strings and values are
`Term` objects. Keys follow the same order as in the official schedule search
form.
Note: old terms no longer on the official schedule search form have their access
restricted, so this module cannot retrieve them. I may maintain schedule pages
from old terms, so contact me if you want access to them.
`Term` has the following data attributes:
`Term.code` - a string containing a `validterm` value from the
[official schedule search form][1]. When you choose a term via one of the
"Select a Term" radio buttons, you are selecting a `validterm` to be submitted
when you click "View Class Schedule".
`Term.name` - a string containing a term name associated with one of the
aforementioned radio buttons.
## Installation
```
pipenv install ucmscraper
```
## Example usage
```python
import json
import pathlib
import ucmscraper
# Create example folder to store output files
pathlib.Path('./example').mkdir(exist_ok=True)
def get_last_value(ordered_dict):
return next(reversed(ordered_dict.values()))
latest_term = get_last_value(ucmscraper.get_terms())
try:
with open('example/{}.html'.format(latest_term.name), 'r') as f:
schedule_html = f.read()
schedule = ucmscraper.Schedule(schedule_html, latest_term)
except FileNotFoundError:
schedule = ucmscraper.Schedule.fetch_latest()
class NamedTupleIterEncoder(json.JSONEncoder):
def default(self, o):
return [t._asdict() for t in o]
term = schedule.term.name
with open('example/{}.html'.format(term), 'w') as f:
f.write(schedule.html)
# OrderedDicts don't need sort_keys=True
with open('example/{} - Departments.json'.format(term), 'w') as f:
json.dump(schedule.departments, f, indent=4)
with open('example/{} - Courses.json'.format(term), 'w') as f:
json.dump([t._asdict() for t in schedule.courses], f, indent=4)
with open('example/{} - Sections.json'.format(term), 'w') as f:
json.dump([t._asdict() for t in schedule.sections], f, indent=4)
```
Check out the resulting schedule files in the [example folder](example/).
[1]: https://mystudentrecord.ucmerced.edu/pls/PROD/xhwschedule.p_selectsubject
[2]: https://docs.python.org/3.5/library/collections.html#collections.OrderedDict
[3]: https://docs.python.org/3.5/library/collections.html#collections.namedtuple
[4]: https://en.wikipedia.org/wiki/Record_(computer_science)
A Python module that scrapes [UC Merced class schedules][1] for you!
## API
Using this module pretty much just entails 1. creating a Schedule instance and
2. reading its data attributes; see below for more details.
### `ucmscraper.Schedule`
A `Schedule` instance object is a fully parsed UC Merced schedule page from a
given term.
The `Schedule` class is a [record type/plain old data structure][4], meaning it
really only structures data into fields and provides very little functionality
on its own. The `Term`, `Course`, and `Section` classes that compose `Schedule`
follow the same vein. It is up to the client to implement their own functions
for handling these types.
`Schedule`s can created in three ways: two involve a factory class method, and
one is a plain constructor.
#### 1. `ucmscraper.Schedule.fetch_latest()`
Performs an HTTP request and, if successful, returns a Schedule object for the
latest term (Fall 2019 at the time of writing).
#### 2. `ucmscraper.Schedule.fetch(term)`
Performs an HTTP request and, if successful, returns a Schedule object for the
given `Term` object. `Term`s should be retrieved via `ucmscraper.get_terms()`.
#### 3. `ucmscraper.Schedule(schedule_html)`
Parses `schedule_html` and returns a Schedule object.
#### Attributes
Schedule has the following data attributes:
`schedule.html` - a string of the raw HTML of the original schedule page
`schedule.term` - a `Term` object containing information about the term
associated with this `Schedule` instance.
`schedule.departments` - an [OrderedDict][2] whose keys are department codes and
whose values are the associated department titles, e.g.:
```
{
'ANTH': 'Anthropology',
'BEST': 'Bio Engin Small Scale Tech',
'BIO': 'Biological Sciences',
'BIOE': 'Bioengineering',
...
}
```
Keys follow the order that they appear in schedule pages, which is alphabetical.
`schedule.courses` - a tuple of `Course` [namedtuples](3) in the order that
courses appear on the schedule page, e.g.
```
(
Course(
department_code='ANTH',
number='001',
title='Sociocultural Anthropology',
units=4
),
...
Course(
department_code='WRI',
number='131C',
title='Undergraduate Research Journal',
units=2
)
)
```
`schedule.sections` - a tuple of `Section` [namedtuples](3), each representing
one non-exam row from the schedule page, and in the order that sections appear
on the schedule page, e.g.:
```
(
Section(
CRN=30250,
department_code='ANTH',
course_number='001',
number='01',
title='Sociocultural Anthropology',
notes=('Must Also Register For A Corresponding Discussion',),
activity='LECT',
days='MW',
start_time='1:30 PM',
end_time='2:45 PM',
location='ACS 120',
instructor='DeLugan, Robin',
max_seats=210,
taken_seats=0,
free_seats=210
),
...
Section(
CRN=34978,
department_code='WRI',
course_number='131C',
number='01',
title='Undergraduate Research Journal',
notes=(),
activity='SEM',
days='W',
start_time='9:30 AM',
end_time='11:20 AM',
location='CLSSRM 272',
instructor='Staff',
max_seats=20,
taken_seats=0,
free_seats=20
)
)
```
### `ucmscraper.get_terms()`
When first called, performs an HTTP request and if successful, returns an
an [OrderedDict][2] of terms currently available for viewing via the
[official schedule search form][1]. Keys are `validterm` strings and values are
`Term` objects. Keys follow the same order as in the official schedule search
form.
Note: old terms no longer on the official schedule search form have their access
restricted, so this module cannot retrieve them. I may maintain schedule pages
from old terms, so contact me if you want access to them.
`Term` has the following data attributes:
`Term.code` - a string containing a `validterm` value from the
[official schedule search form][1]. When you choose a term via one of the
"Select a Term" radio buttons, you are selecting a `validterm` to be submitted
when you click "View Class Schedule".
`Term.name` - a string containing a term name associated with one of the
aforementioned radio buttons.
## Installation
```
pipenv install ucmscraper
```
## Example usage
```python
import json
import pathlib
import ucmscraper
# Create example folder to store output files
pathlib.Path('./example').mkdir(exist_ok=True)
def get_last_value(ordered_dict):
return next(reversed(ordered_dict.values()))
latest_term = get_last_value(ucmscraper.get_terms())
try:
with open('example/{}.html'.format(latest_term.name), 'r') as f:
schedule_html = f.read()
schedule = ucmscraper.Schedule(schedule_html, latest_term)
except FileNotFoundError:
schedule = ucmscraper.Schedule.fetch_latest()
class NamedTupleIterEncoder(json.JSONEncoder):
def default(self, o):
return [t._asdict() for t in o]
term = schedule.term.name
with open('example/{}.html'.format(term), 'w') as f:
f.write(schedule.html)
# OrderedDicts don't need sort_keys=True
with open('example/{} - Departments.json'.format(term), 'w') as f:
json.dump(schedule.departments, f, indent=4)
with open('example/{} - Courses.json'.format(term), 'w') as f:
json.dump([t._asdict() for t in schedule.courses], f, indent=4)
with open('example/{} - Sections.json'.format(term), 'w') as f:
json.dump([t._asdict() for t in schedule.sections], f, indent=4)
```
Check out the resulting schedule files in the [example folder](example/).
[1]: https://mystudentrecord.ucmerced.edu/pls/PROD/xhwschedule.p_selectsubject
[2]: https://docs.python.org/3.5/library/collections.html#collections.OrderedDict
[3]: https://docs.python.org/3.5/library/collections.html#collections.namedtuple
[4]: https://en.wikipedia.org/wiki/Record_(computer_science)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ucmscraper-2.1.0.tar.gz
(6.6 kB
view hashes)
Built Distribution
Close
Hashes for ucmscraper-2.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63dd8d1209aad69790095778c7808ebbb631d67fc4cebbf7a21078c663570122 |
|
MD5 | df9146d1efcd910512bd8d9085088136 |
|
BLAKE2b-256 | 487f9eea0ce46b52c1e8fad791276d2295fd4d94840cc670ecc1d5e6667eb6ee |