Skip to main content

ATProto AppView for ac.foundation.dataset

Project description

atdata-app

An ATProto AppView for the ac.foundation.dataset lexicon namespace. It indexes dataset metadata published across the AT Protocol network and serves it through XRPC endpoints — enabling discovery, search, and resolution of datasets, schemas, labels, and lenses.

Overview

In the AT Protocol architecture, an AppView is a service that subscribes to the network firehose, indexes records it cares about, and exposes query endpoints for clients. atdata-app does this for scientific and ML dataset metadata:

  • Schemas define the structure of datasets (JSON Schema, Arrow schema, etc.)
  • Dataset entries describe a dataset — its name, storage location, schema, tags, license, and size
  • Labels are human-readable version tags pointing to a specific dataset entry (like git tags)
  • Lenses are bidirectional schema transforms with getter/putter code for migrating data between schema versions
ATProto Network
    │
    ├── Jetstream (WebSocket firehose) ──► Real-time ingestion
    │                                         │
    └── BGS Relay (HTTP backfill) ──────► Historical backfill
                                              │
                                              ▼
                                         PostgreSQL
                                              │
                                              ▼
                                     XRPC Query Endpoints ──► Clients

Requirements

  • Python 3.12+
  • PostgreSQL 14+
  • uv package manager

Quickstart

# Install dependencies
uv sync --dev

# Set up PostgreSQL (schema auto-applies on startup)
createdb atdata_app

# Start the server
uv run uvicorn atdata_app.main:app --reload

The server starts with dev-mode defaults: http://localhost:8000, DID did:web:localhost%3A8000. On startup it connects to Jetstream and begins indexing ac.foundation.dataset.* records, and runs a one-shot backfill of historical records from the BGS relay.

Configuration

All settings are environment variables prefixed with ATDATA_, managed by pydantic-settings.

Variable Default Description
ATDATA_HOSTNAME localhost Public hostname, used to derive did:web identity
ATDATA_PORT 8000 Server port (included in DID in dev mode)
ATDATA_DEV_MODE true Dev mode uses http:// and includes port in DID; production uses https://
ATDATA_DATABASE_URL postgresql://localhost:5432/atdata_app PostgreSQL connection string
ATDATA_JETSTREAM_URL wss://jetstream2.us-east.bsky.network/subscribe Jetstream WebSocket endpoint
ATDATA_JETSTREAM_COLLECTIONS ac.foundation.dataset.* Collections to subscribe to
ATDATA_RELAY_HOST https://bsky.network BGS relay for backfill DID discovery

Identity

The service derives its did:web identity from the hostname and port:

  • Dev mode: did:web:localhost%3A8000 with endpoint http://localhost:8000
  • Production: did:web:datasets.example.com with endpoint https://datasets.example.com

The DID document is served at GET /.well-known/did.json and advertises the service as an AtprotoAppView.

API Reference

See docs/api-reference.md for the full XRPC endpoint reference (queries, procedures, and other routes).

Data Model

See docs/data-model.md for the database schema (schemas, entries, labels, lenses).

Docker Deployment

The app ships with a multi-stage Dockerfile using uv for fast dependency installation.

Build and run locally

docker build -t atdata-app .

docker run -p 8000:8000 \
  -e ATDATA_DATABASE_URL=postgresql://user:pass@host:5432/atdata_app \
  -e ATDATA_HOSTNAME=localhost \
  -e ATDATA_DEV_MODE=true \
  atdata-app

Deploy on Railway

The repo includes a railway.toml that configures the Dockerfile builder, health checks at /health, and a restart-on-failure policy.

  1. Connect the repo to a Railway project
  2. Add a PostgreSQL service and link it
  3. Set the required environment variables:
Variable Value
ATDATA_DATABASE_URL Provided by Railway's PostgreSQL plugin (${{Postgres.DATABASE_URL}})
ATDATA_HOSTNAME Your Railway public domain (e.g. atdata-app-production.up.railway.app)
ATDATA_DEV_MODE false
ATDATA_PORT Omit — Railway sets PORT automatically and the container respects it

Optional variables for ingestion tuning:

Variable Default Description
ATDATA_JETSTREAM_URL wss://jetstream2.us-east.bsky.network/subscribe Jetstream endpoint
ATDATA_RELAY_HOST https://bsky.network BGS relay for backfill

Railway will auto-deploy on push, build the Docker image, and start the container.

Development

# Run tests (no database required)
uv run pytest

# Run a single test
uv run pytest tests/test_models.py::test_parse_at_uri -v

# Run with coverage
uv run pytest --cov=atdata_app

# Lint
uv run ruff check src/ tests/

Tests mock all external dependencies (database, HTTP, identity resolution) using unittest.mock.AsyncMock. HTTP endpoint tests use httpx ASGITransport for in-process testing without a running server.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atdata_app-0.1.0b1.tar.gz (120.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atdata_app-0.1.0b1-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file atdata_app-0.1.0b1.tar.gz.

File metadata

  • Download URL: atdata_app-0.1.0b1.tar.gz
  • Upload date:
  • Size: 120.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for atdata_app-0.1.0b1.tar.gz
Algorithm Hash digest
SHA256 6215254acbad625547c84fa7477ba73bab220af294be9b7c7f2b2b9df79ce063
MD5 304b08776664e2a936947589e8c10609
BLAKE2b-256 98a1f65a0c40c49ddd717c6c9d2ad9f9015a3c8280a4a563a82731e06029a437

See more details on using hashes here.

Provenance

The following attestation bundles were made for atdata_app-0.1.0b1.tar.gz:

Publisher: publish.yml on forecast-bio/atdata-app

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file atdata_app-0.1.0b1-py3-none-any.whl.

File metadata

  • Download URL: atdata_app-0.1.0b1-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for atdata_app-0.1.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 081a6942bd2e3476bdc143fec219d10d769b1b1ccce88eaec5cb9d3027e80d86
MD5 7a49befb8bf0f4b02eb7ae913184e497
BLAKE2b-256 3c21b25bd949a6f5b96d22e33bdf0159efa30b51373d902961a371b046b8f5af

See more details on using hashes here.

Provenance

The following attestation bundles were made for atdata_app-0.1.0b1-py3-none-any.whl:

Publisher: publish.yml on forecast-bio/atdata-app

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page