Spark applications for transforming raw incoming data into a set of schemas for analysis.

Project description

Pulse Telemetry

Welcome to the Pulse Telemetry repository. This project contains Spark applications designed to transform raw telemetry data into a series of structured schemas for comprehensive analysis. The flexible architecture allows for easy extension—developers can deploy additional Spark applications to create new schemas or enhance existing ones to meet evolving analytical needs.

Schema Overview

Telemetry

Enhanced schema for individual records from the battery:

Identifiers: device ID, test ID, step number, etc.
Instantaneous Quantities: timestamp, voltage, power, etc., measured at each individual record.
Differential Quantities: duration, differential capacity, etc., changes between consecutive records.
Accumulated Quantities: step duration, step capacity charged, etc., total values accumulated over the step.
User-Defined Data: auxiliary measurements (e.g., temperature) and metadata for additional context.

Step Statistics

Aggregation of telemetry data at the charge/discharge step level:

Time: start time, end time, and duration of the step.
Instantaneous Aggregations: minimum, maximum, start, end, etc., during the step.
Step Accumulations: total step capacity and energy.
Data Diagnostics: maximum voltage delta, maximum duration, etc., for diagnosing data resolution.
User-Defined Data: aggregated auxiliary measurements and metadata.

Cycle Statistics

Aggregation of telemetry data at the cycle level:

Time: start time, end time, and duration of the step.
Instantaneous Aggregations: minimum, maximum, start, end, etc., during the cycle.
Cycle Accumulations: total cycle capacity and energy.
Data Diagnostics: maximum voltage delta, maximum duration, etc., for diagnosing data resolution.
User-Defined Data: aggregated auxiliary measurements and metadata.

Supported Test Vendors

Coming soon...

Developer Notes

See the developer guide for notes on configuring your development environment.

Test Coverage

Data Generators: We've implemented mock data sources that can asynchronously produce data, configurable to emit data at a desired frequency, such as a series of test or asset channels.
Unit Tests: Validate the correctness of individual transformations, ensuring each transformation produces correct values and handles edge cases, such as null values or empty input data.
Integration Tests: Verify the data source and sink connectors, ensuring that the application can successfully connect to, read from, and write to external systems. Rely on Hive and Minio running in a local kind cluster.
End-to-end Tests: Confirm the behavior of applications within Kubernetes, ensuring that Spark applications are correctly deployed as custom resources. These tests check the end-to-end functionality, from deployment to data processing, ensuring that the applications behave as expected when deployed in the cluster.

Partitions and Sort Order

Partitioning: The default partitioning scheme in the application is tuned assuming that individual channels are generating data at ~1 Hz acquisition frequency. For datasets with higher or lower acquisition frequency, you may want to tune the time based partition in the telemetry table.
Sort Order: Adding sort-ordering on cycle, step, and record number can speed up end-user queries. We have opted for global sort order within partitions, which has a tradeoff of inducing some write overhead. Between partitions on test-id and sort ordering, telemetry reads are highly optimized.

UTC as the Standard Timezone

Spark Timezones: Timestamps in Spark are timezone naive, meaning they do not store timezone info. To consistently manage the timezone that timestamps are created in within your Spark application, ensure that you set the timezone in the Spark session to UTC builder.config("spark.sql.session.timeZone", "UTC").
Python-Spark Conversions: When Python datetime objects are read into Spark, they will be adjusted to the session timezone. When the timestamp is read back into python, the time will be adjusted to the local timezone, but no timezone info will be attached. This is ambiguous, so it is recommended to convert these to UTC ts.astimezone(datetime.UTC).
Additional Timestamp Sources: Data entering the system from outside sources will be assumed to be in UTC time. Within this project, we have made every effort to work with UTC as the standard to facilitate logical reasoning about timed quantities.

Reservation of Window Functions

Window Function Usage: Window functions, such as lead() and lag(), significantly increase the size of state within a Spark application, which can lead to increased memory usage and complexity, especially in streaming contexts. To mitigate these issues, we have made a conscious decision to construct our schemas in a way that they are used in preprocessing steps only.
Trade-off and Data Quality: We acknowledge that this design introduces a trade-off: errors due to late-arriving data during preprocessing will propagate down the pipeline. This may affect the accuracy of derived metrics if late data modifies the computed fields (e.g. timeseries.duration__s). However, the benefits of reduced complexity, improved performance, and lower memory usage outweigh the potential risks.

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Oct 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulse_telemetry-0.1.1.tar.gz (22.7 kB view details)

Uploaded Oct 27, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pulse_telemetry-0.1.1-py3-none-any.whl (24.9 kB view details)

Uploaded Oct 27, 2024 Python 3

File details

Details for the file pulse_telemetry-0.1.1.tar.gz.

File metadata

Download URL: pulse_telemetry-0.1.1.tar.gz
Upload date: Oct 27, 2024
Size: 22.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for pulse_telemetry-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a8ced3117ed8d48b1f8bea785742ac4320d82a9971b9627ed4a4ad43bd1c3b41`
MD5	`f04c0ac36c2666efd4f45770c6ebafdc`
BLAKE2b-256	`c5ee4d276a0603d62dd30923be8e4bca9be350860ddd2fafeebc6bc7ef5b56e1`

See more details on using hashes here.

File details

Details for the file pulse_telemetry-0.1.1-py3-none-any.whl.

File metadata

Download URL: pulse_telemetry-0.1.1-py3-none-any.whl
Upload date: Oct 27, 2024
Size: 24.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for pulse_telemetry-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6940ba618eff7f9cadd1b3301084f59aa7294166f0bb5ffefbc36406a25259f`
MD5	`2595f0ec64a39086edbac4e1853d7c62`
BLAKE2b-256	`a78bc469aee733aa2f270b50d1cb4516fb21d76483fe181862254bce9618f063`

See more details on using hashes here.

pulse-telemetry 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Pulse Telemetry

Schema Overview

Telemetry

Step Statistics

Cycle Statistics

Supported Test Vendors

Developer Notes

Test Coverage

Partitions and Sort Order

UTC as the Standard Timezone

Reservation of Window Functions

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes