ucmscraper

Module for scraping UC Merced's class schedules

Project description

# UCMercedule: Scraper
A Python module that scrapes [UC Merced class schedules][1] for you!

## API
Using this module pretty much just entails 1. creating a Schedule instance and
2. reading its data attributes; see below for more details.

### `ucmscraper.Schedule`
A `Schedule` instance object is a fully parsed UC Merced schedule page from a
given term.

The `Schedule` class is a [record type/plain old data structure][4], meaning it
really only structures data into fields and provides very little functionality
on its own. The `Term`, `Course`, and `Section` classes that compose `Schedule`
follow the same vein. It is up to the client to implement their own functions
for handling these types.

`Schedule`s can created in three ways: two involve a factory class method, and
one is a plain constructor.

#### 1. `ucmscraper.Schedule.fetch_latest()`
Performs an HTTP request and, if successful, returns a Schedule object for the
latest term (Fall 2019 at the time of writing).

#### 2. `ucmscraper.Schedule.fetch(term)`
Performs an HTTP request and, if successful, returns a Schedule object for the
given `Term` object. `Term`s should be retrieved via `ucmscraper.get_terms()`.

#### 3. `ucmscraper.Schedule(schedule_html)`
Parses `schedule_html` and returns a Schedule object.

#### Attributes
Schedule has the following data attributes:

`schedule.html` - a string of the raw HTML of the original schedule page

`schedule.term` - a `Term` object containing information about the term
associated with this `Schedule` instance.

`schedule.departments` - an [OrderedDict][2] whose keys are department codes and
whose values are the associated department titles, e.g.:
```
{
'ANTH': 'Anthropology',
'BEST': 'Bio Engin Small Scale Tech',
'BIO': 'Biological Sciences',
'BIOE': 'Bioengineering',
...
}
```
Keys follow the order that they appear in schedule pages, which is alphabetical.

`schedule.courses` - a tuple of `Course` [namedtuples](3) in the order that
courses appear on the schedule page, e.g.
```
(
Course(
department_code='ANTH',
number='001',
title='Sociocultural Anthropology',
units=4
),
...
Course(
department_code='WRI',
number='131C',
title='Undergraduate Research Journal',
units=2
)
)
```

`schedule.sections` - a tuple of `Section` [namedtuples](3), each representing
one non-exam row from the schedule page, and in the order that sections appear
on the schedule page, e.g.:
```
(
Section(
CRN=30250,
department_code='ANTH',
course_number='001',
number='01',
title='Sociocultural Anthropology',
notes=('Must Also Register For A Corresponding Discussion',),
activity='LECT',
days='MW',
start_time='1:30 PM',
end_time='2:45 PM',
location='ACS 120',
instructor='DeLugan, Robin',
max_seats=210,
taken_seats=0,
free_seats=210
),
...
Section(
CRN=34978,
department_code='WRI',
course_number='131C',
number='01',
title='Undergraduate Research Journal',
notes=(),
activity='SEM',
days='W',
start_time='9:30 AM',
end_time='11:20 AM',
location='CLSSRM 272',
instructor='Staff',
max_seats=20,
taken_seats=0,
free_seats=20
)
)
```

### `ucmscraper.get_terms()`
When first called, performs an HTTP request and if successful, returns an
an [OrderedDict][2] of terms currently available for viewing via the
[official schedule search form][1]. Keys are `validterm` strings and values are
`Term` objects. Keys follow the same order as in the official schedule search
form.

Note: old terms no longer on the official schedule search form have their access
restricted, so this module cannot retrieve them. I may maintain schedule pages
from old terms, so contact me if you want access to them.

`Term` has the following data attributes:

`Term.code` - a string containing a `validterm` value from the
[official schedule search form][1]. When you choose a term via one of the
"Select a Term" radio buttons, you are selecting a `validterm` to be submitted
when you click "View Class Schedule".

`Term.name` - a string containing a term name associated with one of the
aforementioned radio buttons.

## Installation
```
pipenv install ucmscraper
```

## Example usage
```python
import json
import pathlib
import ucmscraper

# Create example folder to store output files
pathlib.Path('./example').mkdir(exist_ok=True)

def get_last_value(ordered_dict):
return next(reversed(ordered_dict.values()))

latest_term = get_last_value(ucmscraper.get_terms())
try:
with open('example/{}.html'.format(latest_term.name), 'r') as f:
schedule_html = f.read()
schedule = ucmscraper.Schedule(schedule_html, latest_term)
except FileNotFoundError:
schedule = ucmscraper.Schedule.fetch_latest()

class NamedTupleIterEncoder(json.JSONEncoder):
def default(self, o):
return [t._asdict() for t in o]

term = schedule.term.name
with open('example/{}.html'.format(term), 'w') as f:
f.write(schedule.html)
# OrderedDicts don't need sort_keys=True
with open('example/{} - Departments.json'.format(term), 'w') as f:
json.dump(schedule.departments, f, indent=4)
with open('example/{} - Courses.json'.format(term), 'w') as f:
json.dump([t._asdict() for t in schedule.courses], f, indent=4)
with open('example/{} - Sections.json'.format(term), 'w') as f:
json.dump([t._asdict() for t in schedule.sections], f, indent=4)
```
Check out the resulting schedule files in the [example folder](example/).

[1]: https://mystudentrecord.ucmerced.edu/pls/PROD/xhwschedule.p_selectsubject
[2]: https://docs.python.org/3.5/library/collections.html#collections.OrderedDict
[3]: https://docs.python.org/3.5/library/collections.html#collections.namedtuple
[4]: https://en.wikipedia.org/wiki/Record_(computer_science)

Project details

Release history Release notifications | RSS feed

2.2.0

Sep 7, 2019

This version

2.1.0

Mar 13, 2019

2.0.0

Mar 11, 2019

1.5.1

Mar 11, 2019

1.5.0

Mar 10, 2019

1.4.6

Jun 30, 2018

1.4.5

Jun 29, 2018

1.4.4

Jun 27, 2018

1.4.3

Jun 24, 2018

1.4.2

Jun 24, 2018

1.4.1

Jun 10, 2018

1.4.0

Jun 8, 2018

1.3.0

Jun 8, 2018

1.2.0

Jun 8, 2018

1.1.0

Jun 6, 2018

1.0.0

Jun 5, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ucmscraper-2.1.0.tar.gz (6.6 kB view hashes)

Uploaded Mar 13, 2019 Source

Built Distribution

ucmscraper-2.1.0-py3-none-any.whl (7.4 kB view hashes)

Uploaded Mar 13, 2019 Python 3

Hashes for ucmscraper-2.1.0.tar.gz

Hashes for ucmscraper-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4a445387ba3c0ec22972ea8db6bbf66db95ef2e57d03688329e7fbc2ae6424db`
MD5	`a8d53db6ebd622f34bb4d16e3fade829`
BLAKE2b-256	`019ec51252d7cb828a46736f9a43584e9f217b153eb27705db561e6ddb70296f`

Hashes for ucmscraper-2.1.0-py3-none-any.whl

Hashes for ucmscraper-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63dd8d1209aad69790095778c7808ebbb631d67fc4cebbf7a21078c663570122`
MD5	`df9146d1efcd910512bd8d9085088136`
BLAKE2b-256	`487f9eea0ce46b52c1e8fad791276d2295fd4d94840cc670ecc1d5e6667eb6ee`