Sends JSONL data into an Ed-Fi API
Project description
lightbeam
transmits payloads from JSONL files into an Ed-Fi API.
Table of Contents
- Requirements
- Installation
- Setup
- Usage
- Features
- Design
- Performance
- Limitations
- Changelog
- Contributing
- License
Requirements
Python 3, pip, and connectivity to an Ed-Fi API.
Installation
pip install lightbeam
Setup
Running the tool requires
- a folder of JSONL files, one for each Ed-Fi Resource and Descriptor to populate
- a YAML configuration file An example YAML configuration is below, followed by documentation of each option.
state_dir: ~/.lightbeam/
data_dir: ./
namespace: ed-fi
edfi_api:
base_url: https://api.schooldistrict.org/v5.3/api
oauth_url: https://api.schooldistrict.org/v5.3/api/oauth/token
dependencies_url: https://api.schooldistrict.org/v5.3/api/metadata/data/v3/2024/dependencies
descriptors_swagger_url: https://api.schooldistrict.org/v5.3/api/metadata/data/v3/2024/descriptors/swagger.json
resources_swagger_url: https://api.schooldistrict.org/v5.3/api/metadata/data/v3/2024/resources/swagger.json
version: 3
mode: year_specific
year: 2021
client_id: yourID
client_secret: yourSecret
connection:
pool_size: 8
timeout: 60
num_retries: 10
backoff_factor: 1.5
retry_statuses: [429, 500, 502, 503, 504]
verify_ssl: True
count:
separator: ,
fetch:
page_size: 100
force_delete: True
log_level: INFO
show_stacktrace: True
- (optional)
state_dir
is where state is stored. The default is~/.lightbeam/
on *nix systems,C:/Users/USER/.lightbeam/
on Windows systems. - (optional) Specify the
data_dir
which contains JSONL files to send to Ed-Fi. The default is./
. The tool will look for files like{Resource}.jsonl
or{Descriptor}.jsonl
in this location, as well as directory-based files like{Resource}/*.jsonl
or{Descriptor}/*.jsonl
. Files with.ndjson
or simply.json
extensions will also be processed. (More info at thendjson
standard page.) - (optional) Specify the
namespace
to use when accessing the Ed-Fi API. The default ised-fi
but others includetpdm
or custom values. To send data to multiple namespaces, you must use a YAML configuration file andlightbeam send
for each. - Specify the details of the
edfi_api
to which to connect including- (optional) The
base_url
which serves a JSON object specifying the paths to data endpoints, Swagger, and dependencies. The default ishttps://localhost/api
(the address of an Ed-Fi API running locally in Docker), but the location varies depending on how Ed-Fi is deployed. - If the metadata for a particular API is not located in the "default" location (at the root of the base_url), then ALL the following urls should be explicitly specified. These can normally be left blank, unless you are encountering errors indicating that the metadata files cannot be found (such as "Could not parse response from [base_url]").
- (optional)
oauth_url
(usually [base_url]/oauth/token) - (optional)
dependencies_url
(usually [base_url]/metadata/data/v3/dependencies) - (optional)
descriptors_swagger_url
(usually [base_url]/metadata/data/v3/descriptors/swagger.json) - (optional)
resources_swagger_url
(usually [base_url]/metadata/data/v3/resources/swagger.json)
- (optional)
- The
version
as one of3
or2
(2
is currently unsupported). - (optional) The
mode
as one ofshared_instance
,sandbox
,district_specific
,year_specific
, orinstance_year_specific
. - (required if
mode
isyear_specific
orinstance_year_specific
) Theyear
used to build the resource URL. The default is the current year. - (required if
mode
isinstance_year_specific
) Theinstance_code
used to build the resource URL. The default is none. - (required) Specify the
client_id
to use when connecting to the Ed-Fi API. - (required) Specify the
client_secret
to use when connecting to the Ed-Fi API.
- (optional) The
- Specify the
connection
parameters to use when making requests to the API including- (optional) The
pool_size
. The default is 8. The optimal setting depends on the Ed-Fi API's capabilities. - (optional) The
timeout
(in seconds) to wait for each connection attempt. The default is60
seconds. - (optional) The
num_retries
to do in case of request failures. The default is10
. - (optional) The
backoff_factor
to use for the exponential backoff. The default is1.5
. - (optional) The
retry_statuses
, that is, the HTTPS response codes to consider as failures to retry. The default is[429, 500, 501, 503, 504]
. - (optional) Whether to
verify_ssl
. The default isTrue
. Set toFalse
when working withlocalhost
APIs or to live dangerously.
- (optional) The
- (optional) for
lightbeam count
, optionally change theseparator
betweenRecords
andEndpoint
. The default is a "tab" character. - (optional) for
lightbeam fetch
, optionally specify the number of records (page_size
) to GET at a time. The default is 100, but if you're trying to extract lots of data from an API increase this to the largest allowed (which depends on the API, but is often 500 or even 5000). - (optional) Skip the interactive confirmation prompt (for programmatic use) when using the
delete
command. The default isFalse
(prompt). - (optional) Specify a
log_level
for output. Possible values areERROR
: only output errors like missing required sources, invalid references, invalid YAML configuration, etc.WARNING
: output errors and warnings like when the run log is getting longINFO
: all errors and warnings plus basic information about whatearthmover
is doing: start and stop, how many rows were removed by adistinct_rows
orfilter_rows
operation, etc. (This is the defaultlog_level
.)DEBUG
: all output above, plus verbose details about each transformation step, timing, memory usage, and more. (Thislog_level
is recommended for debugging transformations.)
- (optional) Specify whether to show a stacktrace for runtime errors. The default is
False
.
Usage
lightbeam
recognizes several commands:
count
lightbeam count -c path/to/config.yaml
Prints to the console (or to your --results-file
, if specified) a record count for each endpoint in your Ed-Fi API.
- By default, resources and descriptors (all endpoints) are counted. You can change this by using selectors, such as
-e *Descriptors
. - Endpoint counts printed to the console (if you don't specify a
--results-file
) include only endpoints with more than zero records. Endpoint counts saved in a--results-file
include all available endpoints, even those with zero records. - Whether printed to the console or a
--results-file
, output will include columnsRecords
andEndpoint
separated by a separator specified ascount.separator
in your YAML configuration (default is a "tab" character).
fetch
lightbeam fetch -c path/to/config.yaml
Fetches the payloads of selected endpoints from your Ed-Fi API and saves them, each on their own line, to JSONL files in your data_dir
.
Optionally specify --query '{"studentUniqueId": 12345}'
or -q '{"key": "value"}'
to add query parameters to every GET request. This can be useful if you want to fetch
data for just a specific record (and related data). For example:
lightbeam fetch -s student* -e *Descriptors -q '{"studentUniqueId":12345}' -d id,_etag,_lastModifiedDate
Optionally specify --keep-keys id
or -k id
to keep only specific keys from every payload. This can be useful to reduce the amount of data stored if you only need certain fields. It is used internally by truncate
to only fetch
the id
s or payloads to then delete
by id
.
Optionally specify --drop-keys id,_etag,_lastModified
or -d id
to remove specific keys from every payload. This can be useful if you want to fetch
data from one Ed-Fi API and then turn around and send
it to another.
Like selectors, keep-keys
and drop-keys
are comma-separated lists of values, each of which may begin or end with an asterisk (*
) for wildcard matching. Example: -d _*
would remove properties beginning with an underscore (_
) character from any fetch
ed payloads.
validate
lightbeam validate -c path/to/config.yaml
You may validate
your JSONL before transmitting it. This checks that the payloads
- are valid JSON
- conform to the structure described in the Swagger documents for resources and descriptors fetched from your API
- contain valid descriptor values (fetched from your API and/or from descriptor values in your JSONL files)
- contain unique values for any natural key
This command will not find invalid reference errors, but is helpful for finding payloads that are invalid JSON, are missing required fields, or have other structural issues.
send
lightbeam send -c path/to/config.yaml
Sends your JSONL payloads to your Ed-Fi API.
validate+send
lightbeam validate+send -c path/to/config.yaml
This is a shorthand for sequentially running validate and then send. It can be useful to catching errors in automated pipelines earlier in the validate
step before you actually send
problematic data to your Ed-Fi API.
delete
lightbeam delete -c path/to/config.yaml
Delete payloads by
- determing the natural key (set of required fields) for each endpoint
- iterating through your JSONL payloads and looking up each one via a
GET
request to the API filtering for the natural key values - if exactly one result is returned,
DELETE
ing it byid
Payload hashes are also deleted from saved state. Endpoints are processed in reverse-dependency order to prevent delete failures due to data dependencies.
Note that the default profile for most Ed-Fi API credentials prevents deletion of certain core resources (student
, school
, etc.), even if your credentials were used to create the records. If you get API errors trying to delete records, you may need "no further auth" API credentials.
Running the delete
command will prompt you to type "yes" to confirm. This confirmation prompt can be disabled (for programmatic use) by specifying force_delete: True
in your YAML.
truncate
lightbeam truncate -c path/to/config.yaml
Truncates (empties) your Ed-Fi API for selected endpoints, in dependency-order. USE WITH CAUTION! truncate
works by fetching the id
of every record for a given endpoint and then deleting all records by ID.
Truncate
ing a resource will also clear out the saved state for it.
Note that the default profile for most Ed-Fi API credentials prevents deletion of certain core resources (student
, school
, etc.), even if your credentials were used to create the records. If you get API errors trying to delete records, you may need "no further auth" API credentials.
Running the truncate
command will prompt you to type "yes" to confirm. This confirmation prompt can be disabled (for programmatic use) by specifying force_delete: True
in your YAML.
truncate
is a convenience command which should be used sparingly, as it can generate large numbers of deletes
records and cause performance issues when pulling from deletes
endpoints. If you want to wipe an entire Ed-Fi ODS, a better approach may be to drop and recreate the database (and re-send Descriptors and other default resources as needed).
Other options
See a help message with
lightbeam -h
lightbeam --help
See the tool version with
lightbeam -v
lightbeam --version
Features
This tool includes several special features:
Selectors
Send only a subset of resources or descriptors in your data_dir
using -s
or --selector
:
lightbeam send -c path/to/config.yaml -s schools,students,studentSchoolAssociations
or, similarly, exclude some resources or descriptors using -e
or --exclude
:
lightbeam send -c path/to/config.yaml -e *Descriptors
Selection and exclusion may be a single or comma-separated list of strings or a wildcards (beginning or ending with *
). For example:
lightbeam send -c path/to/config.yaml -s student*,parent* -e *Associations,*Descriptors
would process resources like studentSchoolAttendanceEvents
and parents
, but not studentSchoolAssociations
, studentParentAssociations
, or any Descriptors.
Environment variable references
In your YAML configuration, you may reference environment variables with ${ENV_VAR}
. This can be useful for passing sensitive data like credentials to lightbeam
, such as
...
edfi_api:
client_id: ${EDFI_API_CLIENT_ID}
client_secret: ${EDFI_API_CLIENT_SECRET}
...
Command-line parameters
Similarly, you can specify parameters via the command line with
lightbeam send -c path/to/config.yaml -p '{"CLIENT_ID":"populated", "CLIENT_SECRET":"populatedSecret"}'
lightbeam send -c path/to/config.yaml --params '{"CLIENT_ID":"populated", "CLIENT_SECRET":"populatedSecret"}'
Command-line parameters override any environment variables of the same name.
State
This tool maintains state about payloads previously dispatched to the Ed-Fi API to avoid repeatedly resending the same payloads. This is done by maintaining a pickled Python dictionary of payload hashes for each Ed-Fi resource and descriptor, together with a timestamp and HTTP status code of the last response. The files are located in the config file's state_dir
and have names like {resource}.dat
or {descriptor}.dat
.
By default, only new, never-before-seen payloads are sent
or deleted
.
You may choose to resend payloads last sent before timestamp using the -t
or --older-than
command-line flag:
lightbeam send -c path/to/lightbeam.yaml -t 2020-12-25T00:00:00
lightbeam send -c path/to/lightbeam.yaml --older-than 2020-12-25T00:00:00
Or you may choose to resend payloads last sent after timestamp using the -n
or --newer-than
command-line flag:
lightbeam send -c path/to/lightbeam.yaml -n 2020-12-25T00:00:00
lightbeam send -c path/to/lightbeam.yaml --newer-than 2020-12-25T00:00:00
Or you may choose to resend payloads that returned a certain HTTP status code(s) on the last send using the -r
or --retry-status-codes
command-line flag:
lightbeam send -c path/to/lightbeam.yaml -r 200,201
lightbeam send -c path/to/lightbeam.yaml --retry-status-codes 200,201
These three options may be composed; lightbeam
will resend payloads that match any conditions (logical OR).
Finally, you can ignore prior state and resend all payloads using the -f
or --force
flag:
lightbeam send -c path/to/lightbeam.yaml -f
lightbeam send -c path/to/lightbeam.yaml --force
Cache
To reduce runtime, lightbeam
caches the resource and descriptor Swagger docs it fetches from your Ed-Fi API as well as the descriptor values for up to a month. This way, the data does not have to be re-loaded from your API on every run. The cached files are stored in the cache
directory within your state_dir
. You may run lightbeam
with the -w
or --wipe
flag to clear this cached data and force re-fetching the API metadata:
lightbeam send -c path/to/config.yaml -w
lightbeam send -c path/to/config.yaml --wipe
Structured output of run results
To produce a JSON file with metadata about the run, invoke lightbeam with
lightbeam send -c path/to/config.yaml --results-file ./results.json
A sample results file could be:
{
"started_at": "2023-06-08T17:18:25.053207",
"working_dir": "/home/someuser/code/sandbox/testing_lightbeam",
"config_file": "lightbeam.yml",
"data_dir": "./",
"api_url": "https://some-ed-fi-api.edu/api",
"namespace": "ed-fi",
"resources": {
"studentSchoolAssociations": {
"failed_statuses": {
"400": {
"400: { \"message\": \"The request is invalid.\", \"modelState\": { \"request.schoolReference.schoolId\": [ \"JSON integer 1234567899999 is too large or small for an Int32. Path 'schoolReference.schoolId', line 1, position 328.\" ] } }": {
"files": {
"./studentSchoolAssociations.jsonl": {
"line_numbers": "6,4,5,7,8",
"count": 5
}
}
},
"400: { \"message\": \"Validation of 'StudentSchoolAssociation' failed.\\n\\tStudent reference could not be resolved.\\n\" }": {
"files": {
"./studentSchoolAssociations.jsonl": {
"line_numbers": "1,3,2",
"count": 3
}
}
},
"count": 8
},
"409": {
"409: { \"message\": \"The value supplied for the related 'studentschoolassociation' resource does not exist.\" }": {
"files": {
"./studentSchoolAssociations.jsonl": {
"line_numbers": "9,10,12,14,16,13,11,15,17,18,19,21,22,20",
"count": 14
}
}
},
"count": 14
}
},
"records_processed": 22,
"records_skipped": 0,
"records_failed": 22
}
},
"completed_at": "2023-06-08T17:18:26.724699",
"runtime_sec": 1.671492,
"total_records_processed": 22,
"total_records_skipped": 0,
"total_records_failed": 22
}
Design
Some details of the design of this tool are discussed below.
Resource-dependency ordering
JSONL files are sent to the Ed-Fi API in resource-dependency order, which avoids "missing reference" API errors when populating multiple endpoints.
Asynchronous requests
lightbeam
achieves exceptional performance by making asynchronous requests to the Ed-Fi API - up to connection.pool_size
(in your YAML configuration) at a time.
Performance & Limitations
Tool performance depends on primarily on the performance of the Ed-Fi API, which in turn depends on the compute resources which back it. Typically the bottleneck is write performance to the database backend (SQL server or Postgres). If you use lightbeam
to ingest a large amount of data into an Ed-Fi API (not a recommended use-case), consider temporarily scaling up your database backend.
For reference, we have achieved throughput rates in excess of 100 requests/second against an Ed-Fi ODS & API running in Docker on a laptop.
Changelog
See CHANGELOG.
Contributing
Bugfixes and new features (such as additional transformation operations) are gratefully accepted via pull requests here on GitHub.
Contributions
- Cover image created with DALL • E mini
License
See License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for lightbeam-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3329a9679791e89397f34d900cb02bb674f0c007d7fc61f232ad25e6a079680 |
|
MD5 | f7a6c5f72060c8cc69e6d0cd69fdd8dd |
|
BLAKE2b-256 | e62cab133b4ecc801c576d87d1ba35ee2ceb0891add6ac6d8e94ce46b9e5d76a |