Skip to main content

Asynchronous Python scraper for Celcat Calendar

Project description

Celcat Calendar Scraper 📆

An asynchronous Python library for scraping Celcat calendar systems.

Installation 🚀

pip install celcat-scraper

Features 🌟

  • Event attributes filtering 🔎
  • Async/await support for better performance 🔀
  • Rate limiting with adaptive backoff ⏳
  • Optional caching support 💾
  • Optional reusable aiohttp session ♻️
  • Automatic session management 🍪
  • Batch processing of events 📦
  • Error handling and retries 🚨

Usage ⚙️

Basic example of retrieving calendar events:

import asyncio
from datetime import date, timedelta
from celcat_scraper import CelcatConfig, CelcatScraperAsync

async def main():
    # Configure the scraper
    config = CelcatConfig(
        url="https://university.com/calendar",
        username="your_username",
        password="your_password",
        include_holidays=True,
    )

    # Create scraper instance and get events
    async with CelcatScraperAsync(config) as scraper:
        start_date = date.today()
        end_date = start_date + timedelta(days=30)

        # Recommended to store events locally and reduce the amout of requests
        file_path = "store.json"
        events = scraper.deserialize_events(file_path)

        events = await scraper.get_calendar_events(
            start_date, end_date, previous_events=events
        )

        for event in events:
            print(f"Event {event['id']}")
            print(f"Course: {event['category']} - {event['course']}")
            print(f"Time: {event['start']} to {event['end']}")
            print(f"Location: {', '.join(event['rooms'])} at {', '.join(event['sites'])} - {event['department']}")
            print(f"Professors: {', '.join(event['professors'])}")
            print("---")

        # Save events for a future refresh
        scraper.serialize_events(events, file_path)

if __name__ == "__main__":
    asyncio.run(main())

Filtering 🔍

Celcat Calendar data is often messy, and needs to be processed before it can be used. For example, the same course may have several different names in different events. Filtering allows these attributes to be standardized.

Usage ⚙️

ℹ️ Info: Each filter argument is optional. When course_strip_redundant is enabled, using remembered_strips is recommended.

⚠️ Warning: Disabling filters will require you to reset your previous events and refetch to undo changes.

import asyncio
from datetime import date, timedelta
import json
from celcat_scraper import CelcatFilterConfig, FilterType, CelcatConfig, CelcatScraperAsync

async def main():
    # Load remembered_strips from a file
    remembered_strips = []
    try:
        with open("remembered_strips.json", "r") as f:
            remembered_strips = json.load(f)
    except (FileNotFoundError, json.JSONDecodeError):
        remembered_strips = []

    # Create a list of manual course replacements
    course_replacements = {"English - S2": "English", "Mathematics": "Maths"}

    # Configure a filter
    filter_config = CelcatFilterConfig(
        filters = {
            FilterType.COURSE_TITLE,
            FilterType.COURSE_STRIP_MODULES,
            FilterType.COURSE_STRIP_CATEGORY,
            FilterType.COURSE_STRIP_PUNCTUATION,
            FilterType.COURSE_GROUP_SIMILAR,
            FilterType.COURSE_STRIP_REDUNDANT,
            FilterType.PROFESSORS_TITLE,
            FilterType.ROOMS_TITLE,
            FilterType.ROOMS_STRIP_AFTER_NUMBER,
            FilterType.SITES_TITLE,
            FilterType.SITES_REMOVE_DUPLICATES,
        }
        course_remembered_strips=remembered_strips,
        course_replacements=course_replacements,
    )

    config = CelcatConfig(
        url="https://university.com/calendar",
        username="your_username",
        password="your_password",
        include_holidays=True,
        # Pass the filter as an argument
        filter_config=filter_config,
    )

    async with CelcatScraperAsync(config) as scraper:
        start_date = date.today()
        end_date = start_date + timedelta(days=30)

        events = scraper.deserialize_events("store.json")
        events = await scraper.get_calendar_events(
            start_date, end_date, previous_events=events
        )

        scraper.serialize_events(events, file_path)

    # Save the updated remembered_strips back to file
    with open("remembered_strips.json", "w") as f:
        json.dump(scraper.filter_config.course_remembered_strips, f)

if __name__ == "__main__":
    asyncio.run(main())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

celcat_scraper-1.1.0.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

celcat_scraper-1.1.0-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file celcat_scraper-1.1.0.tar.gz.

File metadata

  • Download URL: celcat_scraper-1.1.0.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for celcat_scraper-1.1.0.tar.gz
Algorithm Hash digest
SHA256 40574b862ba7e59ea6d64b7dcd2dc214836275eada78aa387dfa514ca27e3e2c
MD5 b731dca6db0145ef2e5b419c64fa22fd
BLAKE2b-256 0cbda51bfe83237504ceb2456d6f9b63da7d41e68bb84bed2213602f9281265e

See more details on using hashes here.

Provenance

The following attestation bundles were made for celcat_scraper-1.1.0.tar.gz:

Publisher: python-publish.yml on etiennec78/celcat-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file celcat_scraper-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: celcat_scraper-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for celcat_scraper-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7da15a8a86883c1a9a5a80a6fb56d2ea1f8dd26385e852debc260680a3166c81
MD5 a6d8f5a4555c7c58f458289e60e67ead
BLAKE2b-256 2e9bb47c80da81a7d3aa3d3b9fa16112669cf4e6c0780d23338347f1b644adcd

See more details on using hashes here.

Provenance

The following attestation bundles were made for celcat_scraper-1.1.0-py3-none-any.whl:

Publisher: python-publish.yml on etiennec78/celcat-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page