Skip to main content

Pythonic data engine for Texas education datasets

Project description

TEA Data Engine (teadata)

teadata is a snapshot-first Python engine for Texas education data. It provides:

  • District and Campus domain models
  • a fluent query DSL using >>
  • geospatial lookups (nearest charter, campuses in district boundaries, private-school overlap)
  • config-driven enrichment from TAPR, accountability, transfers, PEIMS financials, and closure datasets
  • sidecar sqlite stores for fast boundary/map/entity lookup

Installation

PyPI

pip install teadata

Development (recommended)

git clone https://github.com/adpena/teadata.git
cd teadata
uv sync --all-extras

Quick Start

from teadata import DataEngine

# Preferred runtime path: load the latest discovered snapshot.
engine = DataEngine.from_snapshot(search=True)

# District lookup by district number, campus number, or name.
aldine = engine.get_district("101902")
print(aldine.name)

# Campuses physically inside district boundaries.
for campus in aldine.campuses[:5]:
    print(campus.name, campus.campus_number)

Public API Surface

Primary imports:

from teadata import DataEngine, District, Campus

Core behaviors:

  • DataEngine.from_snapshot(...) supports .pkl and .pkl.gz snapshots and multiple payload shapes.
  • Snapshot discovery checks explicit paths, env vars, package .cache, and parent .cache directories.
  • District and Campus support dynamic metadata attributes through meta.
  • Campus.to_dict() always includes percent_enrollment_change (numeric when available, otherwise "N/A").

Snapshot and Asset Behavior

teadata is intentionally cache-first.

Artifacts typically used at runtime:

  • repo_*.pkl / repo_*.pkl.gz (engine snapshot)
  • boundaries_*.sqlite (boundary WKB sidecar)
  • map_payloads_*.sqlite (map payload sidecar)
  • entities_*.sqlite (entity lookup sidecar)

If snapshot/store files are Git LFS pointers or missing locally, runtime asset resolvers can fetch real files when URL env vars are provided.

Environment Variables

  • TEADATA_SNAPSHOT: explicit snapshot path.
  • TEADATA_SNAPSHOT_URL: URL used when snapshot candidate is missing or a Git LFS pointer.
  • TEADATA_BOUNDARY_STORE: explicit boundary sqlite path.
  • TEADATA_BOUNDARY_STORE_URL: URL fallback for boundary store.
  • TEADATA_MAP_STORE: explicit map sqlite path.
  • TEADATA_MAP_STORE_URL: URL fallback for map store.
  • TEADATA_ENTITY_STORE: explicit entity sqlite path.
  • TEADATA_ENTITY_STORE_URL: URL fallback for entity store.
  • TEADATA_ASSET_CACHE_DIR: override cache directory used for downloaded assets.
  • TEADATA_DISABLE_INDEXES: disable default spatial acceleration indexes.
  • TEADATA_LOG_MEMORY: enable memory snapshot logging.

Query DSL

DataEngine and Query chains use >>.

# Resolve district then expand to district-operated campuses.
q = engine >> ("district", "ALDINE ISD") >> ("campuses_in",)

# Filter, sort, and take.
top = (
    q
    >> ("filter", lambda c: (c.enrollment or 0) > 1000)
    >> ("sort", lambda c: c.enrollment or 0, True)
    >> ("take", 10)
)

rows = top.to_df(columns=["name", "campus_number", "enrollment"])

Supported lookup semantics include:

  • case-insensitive district and campus name matching
  • wildcard patterns (*, ?, SQL-like %/_)
  • normalized district number handling (for example "123" and "'000123")

Spatial and transfer helpers include:

  • nearest-campus/nearest-charter queries
  • nearest_charter_same_type(...)
  • transfer graph methods such as transfers_out(...) / transfers_in(...)

Enrichment Pipeline

teadata/enrichment provides registered enrichers for district and campus datasets.

Included enrichers cover:

  • district accountability and district TAPR profile data
  • campus accountability, TAPR profile/historical enrollment, PEIMS financials
  • planned closure overlays
  • charter network augmentation

Pipeline behavior is fault-tolerant by design: dataset-level failures are generally logged and do not hard-stop the full build.

Data Build Pipeline

teadata/load_data.py builds a full DataEngine and updates cached artifacts.

uv run python -m teadata.load_data

At a high level, it:

  1. resolves year-aware source paths from teadata/teadata_sources.yaml
  2. warm-loads compatible snapshot cache when signatures match
  3. otherwise builds districts/campuses from spatial files
  4. applies enrichment datasets
  5. writes snapshot + sqlite sidecars back to .cache/

Config and CLI (teadata-config)

teadata/teadata_config.py provides YAML/TOML config loading, year resolution, schema checks, and dataset joins.

CLI entrypoint:

uv run teadata-config --help

Subcommands:

  • init <out.yaml>
  • resolve <cfg> <section> <dataset> <year>
  • report <cfg> [--json] [--min N] [--max N]
  • join <cfg> <year> [--datasets a,b,c] [--parquet out.parquet] [--duckdb out.duckdb --table t]

Testing

uv run pytest

Current tests cover:

  • snapshot gzip and fallback loading
  • query DSL semantics and chaining
  • nearest charter behavior and transfer grouping
  • store discovery and asset-cache behavior
  • entity serialization invariants (percent_enrollment_change)

PyPI Size Limits and Current Packaging Status

PyPI defaults currently documented at:

  • per-file upload limit: 100 MB
  • total project limit: 10 GB

Reference: https://docs.pypi.org/project-management/storage-limits/

Current teadata release artifacts for 0.0.118 are above the per-file limit:

  • wheel: dist/teadata-0.0.118-py3-none-any.whl about 448 MB
  • sdist: dist/teadata-0.0.118.tar.gz about 446 MB

These exceed the default 100 MB file cap because large .cache snapshot/store artifacts are packaged into both distributions.

Release Policy

  • Versioning uses thousandths-place tags (v0.0.101, v0.0.102, ...).
  • Keep only the 3 most recent release tags/assets.

License

Business Source License 1.1. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

teadata-0.0.118.tar.gz (75.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

teadata-0.0.118-py3-none-any.whl (77.6 MB view details)

Uploaded Python 3

File details

Details for the file teadata-0.0.118.tar.gz.

File metadata

  • Download URL: teadata-0.0.118.tar.gz
  • Upload date:
  • Size: 75.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for teadata-0.0.118.tar.gz
Algorithm Hash digest
SHA256 24e765ab055ef1f13eb8acdea1a66fc4fcf219a64fb5d6ac2dc64a1acde66acf
MD5 869f1a31fdac35a62bb5e7d7c860cfb9
BLAKE2b-256 90f8b5daa3a9f36991070b60913edc3e92622f0559b30461a8cac1bde604443d

See more details on using hashes here.

File details

Details for the file teadata-0.0.118-py3-none-any.whl.

File metadata

  • Download URL: teadata-0.0.118-py3-none-any.whl
  • Upload date:
  • Size: 77.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for teadata-0.0.118-py3-none-any.whl
Algorithm Hash digest
SHA256 3f730148bf6e2c026163055b10f100e24eeb5c40516bdeed30e0d90581eeb336
MD5 d762c5c9153762a22bca0518de77a973
BLAKE2b-256 b4b56274de11e1264bb46e84b3cc12cf0044e962ac341c341246981139fcfb08

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page