Skip to main content

Spark applications for transforming raw incoming data into a set of schemas for analysis.

Project description

Pulse Telemetry

Welcome to the Pulse Telemetry repository. This project contains Spark applications designed to transform raw telemetry data into a series of structured schemas for comprehensive analysis. The flexible architecture allows for easy extension—developers can deploy additional Spark applications to create new schemas or enhance existing ones to meet evolving analytical needs.

Schema Overview

Telemetry

Enhanced schema for individual records from the battery:

  • Identifiers: device ID, test ID, step number, etc.
  • Instantaneous Quantities: timestamp, voltage, power, etc., measured at each individual record.
  • Differential Quantities: duration, differential capacity, etc., changes between consecutive records.
  • Accumulated Quantities: step duration, step capacity charged, etc., total values accumulated over the step.
  • User-Defined Data: auxiliary measurements (e.g., temperature) and metadata for additional context.

Step Statistics

Aggregation of telemetry data at the charge/discharge step level:

  • Time: start time, end time, and duration of the step.
  • Instantaneous Aggregations: minimum, maximum, start, end, etc., during the step.
  • Step Accumulations: total step capacity and energy.
  • Data Diagnostics: maximum voltage delta, maximum duration, etc., for diagnosing data resolution.
  • User-Defined Data: aggregated auxiliary measurements and metadata.

Cycle Statistics

Aggregation of telemetry data at the cycle level:

  • Time: start time, end time, and duration of the step.
  • Instantaneous Aggregations: minimum, maximum, start, end, etc., during the cycle.
  • Cycle Accumulations: total cycle capacity and energy.
  • Data Diagnostics: maximum voltage delta, maximum duration, etc., for diagnosing data resolution.
  • User-Defined Data: aggregated auxiliary measurements and metadata.

Supported Test Vendors

Coming soon...

Developer Notes

See the developer guide for notes on configuring your development environment.

Test Coverage

  • Data Generators: We've implemented mock data sources that can asynchronously produce data, configurable to emit data at a desired frequency, such as a series of test or asset channels.

  • Unit Tests: Validate the correctness of individual transformations, ensuring each transformation produces correct values and handles edge cases, such as null values or empty input data.

  • Integration Tests: Verify the data source and sink connectors, ensuring that the application can successfully connect to, read from, and write to external systems. Rely on Hive and Minio running in a local kind cluster.

  • End-to-end Tests: Confirm the behavior of applications within Kubernetes, ensuring that Spark applications are correctly deployed as custom resources. These tests check the end-to-end functionality, from deployment to data processing, ensuring that the applications behave as expected when deployed in the cluster.

Partitions and Sort Order

  • Partitioning: The default partitioning scheme in the application is tuned assuming that individual channels are generating data at ~1 Hz acquisition frequency. For datasets with higher or lower acquisition frequency, you may want to tune the time based partition in the telemetry table.

  • Sort Order: Adding sort-ordering on cycle, step, and record number can speed up end-user queries. We have opted for global sort order within partitions, which has a tradeoff of inducing some write overhead. Between partitions on test-id and sort ordering, telemetry reads are highly optimized.

UTC as the Standard Timezone

  • Spark Timezones: Timestamps in Spark are timezone naive, meaning they do not store timezone info. To consistently manage the timezone that timestamps are created in within your Spark application, ensure that you set the timezone in the Spark session to UTC builder.config("spark.sql.session.timeZone", "UTC").

  • Python-Spark Conversions: When Python datetime objects are read into Spark, they will be adjusted to the session timezone. When the timestamp is read back into python, the time will be adjusted to the local timezone, but no timezone info will be attached. This is ambiguous, so it is recommended to convert these to UTC ts.astimezone(datetime.UTC).

  • Additional Timestamp Sources: Data entering the system from outside sources will be assumed to be in UTC time. Within this project, we have made every effort to work with UTC as the standard to facilitate logical reasoning about timed quantities.

Reservation of Window Functions

  • Window Function Usage: Window functions, such as lead() and lag(), significantly increase the size of state within a Spark application, which can lead to increased memory usage and complexity, especially in streaming contexts. To mitigate these issues, we have made a conscious decision to construct our schemas in a way that they are used in preprocessing steps only.

  • Trade-off and Data Quality: We acknowledge that this design introduces a trade-off: errors due to late-arriving data during preprocessing will propagate down the pipeline. This may affect the accuracy of derived metrics if late data modifies the computed fields (e.g. timeseries.duration__s). However, the benefits of reduced complexity, improved performance, and lower memory usage outweigh the potential risks.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulse_telemetry-0.1.1.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pulse_telemetry-0.1.1-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file pulse_telemetry-0.1.1.tar.gz.

File metadata

  • Download URL: pulse_telemetry-0.1.1.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for pulse_telemetry-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a8ced3117ed8d48b1f8bea785742ac4320d82a9971b9627ed4a4ad43bd1c3b41
MD5 f04c0ac36c2666efd4f45770c6ebafdc
BLAKE2b-256 c5ee4d276a0603d62dd30923be8e4bca9be350860ddd2fafeebc6bc7ef5b56e1

See more details on using hashes here.

File details

Details for the file pulse_telemetry-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pulse_telemetry-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f6940ba618eff7f9cadd1b3301084f59aa7294166f0bb5ffefbc36406a25259f
MD5 2595f0ec64a39086edbac4e1853d7c62
BLAKE2b-256 a78bc469aee733aa2f270b50d1cb4516fb21d76483fe181862254bce9618f063

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page