Skip to main content

Sends JSONL data into an Ed-Fi API

Project description

lightbeam

lightbeam transmits payloads from JSONL files into an Ed-Fi API.

Table of Contents

Requirements

Python 3, pip, and connectivity to an Ed-Fi API.

Installation

pip install git+https://github.com/edanalytics/lightbeam.git#egg=lightbeam

Setup

Running the tool requires

  1. a folder of JSONL files, one for each Ed-Fi Resource and Descriptor to populate
  2. a YAML configuration file An example YAML configuration is below, followed by documentation of each option.
state_dir: ~/.lighbeam/
data_dir: ./
namespace: ed-fi
edfi_api:
  base_url: https://api.schooldistrict.org/v5.3/api
  version: 3
  mode: year_specific
  year: 2021
  client_id: yourID
  client_secret: yourSecret
connection:
  pool_size: 8
  timeout: 60
  num_retries: 10
  backoff_factor: 1.5
  retry_statuses: [429, 500, 502, 503, 504]
  verify_ssl: True
force_delete: True
log_level: INFO
show_stacktrace: True
  • (optional) state_dir is where state is stored. The default is ~/.lightbeam/ on *nix systems, C:/Users/USER/.lightbeam/ on Windows systems.
  • (optional) Specify the data_dir which contains JSONL files to send to Ed-Fi. The default is ./. The tool will look for files like {Resource}.jsonl or {Descriptor}.jsonl in this location, as well as directory-based files like {Resource}/*.jsonl or {Descriptor}/*.jsonl. Files with .ndjson or simply .json extensions will also be processed. (More info at the ndjson standard page.)
  • (optional) Specify the namespace to use when accessing the Ed-Fi API. The default is ed-fi but others include tpdm or custom values.
  • Specify the details of the edfi_api to which to connect including
    • (optional) The base_url The default is https://localhost/api (the address of an Ed-Fi API running locally in Docker).
    • The version as one of 3 or 2 (2 is currently unsupported).
    • (optional) The mode as one of shared_instance, sandbox, district_specific, year_specific, or instance_year_specific.
    • (required if mode is year_specific or instance_year_specific) The year used to build the resource URL. The default is the current year.
    • (required if mode is instance_year_specific) The instance_code used to build the resource URL. The default is none.
    • (required) Specify the client_id to use when connecting to the Ed-Fi API.
    • (required) Specify the client_secret to use when connecting to the Ed-Fi API.
  • Specify the connection parameters to use when making requests to the API including
    • (optional) The pool_size. The default is 8. The optimal setting depends on the Ed-Fi API's capabilities.
    • (optional) The timeout (in seconds) to wait for each connection attempt. The default is 60 seconds.
    • (optional) The num_retries to do in case of request failures. The default is 10.
    • (optional) The backoff_factor to use for the exponential backoff. The default is 1.5.
    • (optional) The retry_statuses, that is, the HTTPS response codes to consider as failures to retry. The default is [429, 500, 501, 503, 504].
    • (optional) Whether to verify_ssl. The default is True. Set to False when working with localhost APIs or to live dangerously.
  • (optional) Skip the interactive confirmation prompt (for programmatic use) when using the delete command. The default is False (prompt).
  • (optional) Specify a log_level for output. Possible values are
    • ERROR: only output errors like missing required sources, invalid references, invalid YAML configuration, etc.
    • WARNING: output errors and warnings like when the run log is getting long
    • INFO: all errors and warnings plus basic information about what earthmover is doing: start and stop, how many rows were removed by a distinct_rows or filter_rows operation, etc. (This is the default log_level.)
    • DEBUG: all output above, plus verbose details about each transformation step, timing, memory usage, and more. (This log_level is recommended for debugging transformations.)
  • (optional) Specify whether to show a stacktrace for runtime errors. The default is False.

Usage

lightbeam recognizes several commands:

validate

lightbeam validate -c path/to/config.yaml

You may validate your JSONL before transmitting it. This checks that the payloads

  • are valid JSON
  • conform to the structure described in the Swagger documents for resources and descriptors fetched from your API
  • contain valid descriptor values (fetched from your API)
  • contain unique values for any natural key

This command will not find invalid reference errors, but is helpful for finding payloads that are invalid JSON, are missing required fields, or have other structural issues.

send

lightbeam send -c path/to/config.yaml

Sends your JSONL payloads to your Ed-Fi API.

validate+send

lightbeam validate+send -c path/to/config.yaml

This is a shorthand for sequentially running validate and then send. It can be useful to catching errors in automated pipelines earlier in the validate step before you actually send problematic data to your Ed-Fi API.

delete

lightbeam delete -c path/to/config.yaml

Delete payloads by

  1. determing the natural key (set of required fields) for each endpoint
  2. iterating through your payloads and looking up each one via a GET request to the API filtering for the natural key values
  3. if exactly one result is returned, DELETEing it by id

Payload hashes are also deleted from saved state. Endpoints are processed in reverse-dependency order to prevent delete failures due to data dependencies.

Note that student resource payloads cannot be deleted since other resources may reference them. (This is an Ed-Fi API restriction.)

Running the delete command will prompt you to type "yes" to confirm. This confirmation prompt can be disabled (for programmatic use) by specifying force_delete: True in your YAML.

Other options

See a help message with

lightbeam -h
lightbeam --help

See the tool version with

lightbeam -v
lightbeam --version

Features

This tool includes several special features:

Selectors

Send only a subset of resources or descriptors in your data_dir using a selector:

lightbeam send -c path/to/config.yaml -s schools,students,studentSchoolAssociations

Environment variable references

In your YAML configuration, you may reference environment variables with ${ENV_VAR}. This can be useful for passing sensitive data like credentials to lightbeam, such as

...
edfi_api:
  client_id: ${EDFI_API_CLIENT_ID}
  client_secret: ${EDFI_API_CLIENT_SECRET}
...

Command-line parameters

Similarly, you can specify parameters via the command line with

lightbeam send -c path/to/config.yaml -p '{"CLIENT_ID":"populated", "CLIENT_SECRET":"populatedSecret"}'
lightbeam send -c path/to/config.yaml --params '{"CLIENT_ID":"populated", "CLIENT_SECRET":"populatedSecret"}'

Command-line parameters override any environment variables of the same name.

State

This tool maintains state about payloads previously dispatched to the Ed-Fi API to avoid repeatedly resending the same payloads. This is done by maintaining a pickled Python dictionary of payload hashes for each Ed-Fi resource and descriptor, together with a timestamp and HTTP status code of the last response. The files are located in the config file's state_dir and have names like {resource}.dat or {descriptor}.dat.

By default, only new, never-before-seen payloads are sent.

You may choose to resend payloads last sent before timestamp using the -t or --older-than command-line flag:

lightbeam send -c path/to/lightbeam.yaml -t 2020-12-25T00:00:00
lightbeam send -c path/to/lightbeam.yaml --older-than 2020-12-25T00:00:00

Or you may choose to resend payloads last sent after timestamp using the -n or --newer-than command-line flag:

lightbeam send -c path/to/lightbeam.yaml -n 2020-12-25T00:00:00
lightbeam send -c path/to/lightbeam.yaml --newer-than 2020-12-25T00:00:00

Or you may choose to resend payloads that returned a certain HTTP status code(s) on the last send using the -r or --retry-status-codes command-line flag:

lightbeam send -c path/to/lightbeam.yaml -r 200,201
lightbeam send -c path/to/lightbeam.yaml --retry-status-codes 200,201

These three options may be composed; lightbeam will resend payloads that match any conditions (logical OR).

Finally, you can ignore prior state and resend all payloads using the -f or --force flag:

lightbeam send -c path/to/lightbeam.yaml -f
lightbeam send -c path/to/lightbeam.yaml --force

Cache

To reduce runtime, lightbeam caches the resource and descriptor Swagger docs it fetches from your Ed-Fi API as well as the descriptor values for up to a month. This way, the data does not have to be re-loaded from your API on every run. The cached files are stored in the cache directory within your state_dir. You may run lightbeam with the -w or --wipe flag to clear the cache and force re-fetching the API metadata:

lightbeam send -c path/to/config.yaml -w
lightbeam send -c path/to/config.yaml --wipe

Design

Some details of the design of this tool are discussed below.

Resource-dependency ordering

JSONL files are sent to the Ed-Fi API in resource-dependency order, which avoids "missing reference" API errors when populating multiple endpoints.

Performance & Limitations

Tool performance depends on primarily on the performance of the Ed-Fi API, which in turn depends on the compute resources which back it. Typically the bottleneck is write performance to the database backend (SQL server or Postgres). If you use lightbeam to ingest a large amount of data into an Ed-Fi API (not the recommended use-case, by the way), consider temporarily scaling up your database backend.

For reference, we have achieved throughput rates in excess of 100 requests/second against an Ed-Fi ODS & API running in Docker on a laptop.

Change log

[2022-09-22] Version 0.0.1 released

Contributing

Bugfixes and new features (such as additional transformation operations) are gratefully accepted via pull requests here on GitHub.

Contributions

License

See License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightbeam-0.0.1.tar.gz (33.5 kB view hashes)

Uploaded Source

Built Distribution

lightbeam-0.0.1-py3-none-any.whl (27.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page