Skip to main content

A polite scraper for NYC tech events from GarysGuide

Project description

garys_nyc_events

A polite, dependency-light Python library for extracting NYC tech events from GarysGuide.

This repository also includes a production-oriented data pipeline layer:

  • Scrape events
  • Persist to SQLite (runs, products, product_snapshots)
  • Run repeatedly on cron
  • Containerize scheduler + persistence with Docker named volume

Features

  • Scrapes the GarysGuide events page
  • Polite throttling and a browser-like User-Agent
  • Extracts title, date, price (including "FREE"), and URL
  • Includes a newsletter HTML fallback parser

Install

pip install garys_nyc_events

Pipeline Quick Start

One-shot run (scrape -> persist SQLite)

DB_PATH=./local_events.db poetry run garys-events-run-once

Verify DB

DB_PATH=./local_events.db ./scripts/verify_db.sh

Scheduler in Docker (cron)

docker compose up --build -d
docker compose logs -f scheduler

SQLite data persists in named volume garys_events_data.

PyPI + Poetry Setup

poetry install

Publish to PyPI

poetry build
poetry publish

Releases (GitHub → PyPI)

This repo includes a GitHub Actions workflow that publishes to PyPI when you push a version tag.

  1. Add a GitHub repo secret named PYPI_API_TOKEN with your PyPI API token.
  2. Ensure tool.poetry.version in pyproject.toml is set.
  3. Create and push a matching tag:
git tag v0.2.0
git push origin v0.2.0

The workflow verifies the tag matches v{version}, runs tests, builds, checks the dist, then publishes.

Environment Variables (Config Contract)

Variable Default Purpose
CRON_SCHEDULE 0 */6 * * * Cron schedule for recurring runs
TZ UTC Timezone for cron runtime
SCRAPER_STRATEGY web Scraper mode (currently web)
SCRAPER_SEARCH_TERM empty Optional keyword filter on event title
SCRAPER_LIMIT 0 Max events per run (0 = no limit)
DB_PATH /data/garys_events.db SQLite file path
RETRY_ATTEMPTS 3 Retries for transient failures
RETRY_BACKOFF_SECONDS 5 Linear backoff base seconds
API_TOKEN empty Reserved for future API strategy

Run statuses written to runs.status:

  • success: no error
  • partial: some data + error
  • failure: no data + error

Publish to TestPyPI

poetry config repositories.testpypi https://test.pypi.org/legacy/
poetry publish -r testpypi

Configure PyPI Token (Recommended)

poetry config pypi-token.pypi YOUR_TOKEN

Usage

from garys_nyc_events import (
	GarysGuideScraper,
	get_events,
	get_events_ai_json,
	get_events_safe,
	parse_newsletter_html,
)

# Live scrape (polite delay included)
events = get_events()

# Safe mode: returns [] instead of raising on network errors
events = get_events_safe()

# JSON output of AI-related events (filtered by title)
ai_events_json = get_events_ai_json()
print(ai_events_json)

# Parse raw HTML from a newsletter export
raw_html = "<html>...your email html...</html>"
newsletter_events = parse_newsletter_html(raw_html)

# Class-based usage (custom delay)
scraper = GarysGuideScraper(delay_seconds=2.0)
events = scraper.get_events()

How the Scraper Works

  • Selects anchors where href contains /events/
  • Walks up to the nearest tr, li, div, or article to capture context
  • If the container is a table row, it uses the first cell for date and the last cell for price
  • Extracts prices using $ amounts or FREE
  • Normalizes relative URLs to full URLs

Notes

  • The public API returns a list of dictionaries with keys: title, date, price, url, source.
  • The scraper is polite by default; adjust delay_seconds if needed.
  • Live E2E test is disabled by default. Run with RUN_E2E=1 to enable.

Development

poetry install
poetry run pytest

Verify Build Artifacts

./scripts/verify_build.sh

Operations Docs

Contributing

See CONTRIBUTING.md.

Changelog

See CHANGELOG.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

garys_nyc_events-0.2.0.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

garys_nyc_events-0.2.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file garys_nyc_events-0.2.0.tar.gz.

File metadata

  • Download URL: garys_nyc_events-0.2.0.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for garys_nyc_events-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ba63339bfa6efcec2c935e77fff2959e25bb32e56669f27168f6aeb13dc5583e
MD5 3fb25fe1bbd733787a9bb0fc8c29af65
BLAKE2b-256 67bbfbd9e5b0b46770175ed920506fccb1cf9afa6ba5be785bb23ed0f48e9b47

See more details on using hashes here.

File details

Details for the file garys_nyc_events-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for garys_nyc_events-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eb8b31e410175a48946f7dc34334058df67a001927b18b0ae8ede84e994cccc8
MD5 ee219c30895221b10fe0669bd358ec38
BLAKE2b-256 41f558af984a42fcb50666b19611d27f6a81dd8feb2f3722e3731ae072f9f52e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page