Skip to main content

Reading and writing CBOR files with PySpark

Project description

pyspark-cbor

By no means ready for production use. This project is still in development. See the TODOs section for more information.

This library implements custom Spark data source cbor built using the new Python Data Source API for the upcoming Apache Spark 4.0 release. For an in-depth understanding of the API, please refer to the API source code.

Supported features

  • Support all CBOR data types, including nested structures. Caveat: CBOR is more flexible than Spark's schema, so some data may be lost. See Permissive mode section for more information.
  • Read CBOR file(s) (in parallel) with a specified schema
  • Read CBOR file(s) with base64 encoded values
  • Read CBOR file(s) from a local filepath or azure, aws or gcp storage

Installation

pip install pyspark-cbor==0.1.0

Example

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark_cbor import CBORDataSource

# Initialize Spark session
spark = SparkSession.builder
.appName("CBOR Data Source Example")
.getOrCreate()

# Register the CBOR data source
spark.dataSource.register(CBORDataSource)

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True)
])

# Read CBOR file
df = spark.read.format("cbor").schema(schema).load("path/to/your/file.cbor")

# Show the DataFrame
df.show()

Settings

  • path: The path to the CBOR file or directory containing CBOR files.
  • schema: The schema of the CBOR file. Currently, the schema can't be specified as string
  • mode: The mode of reading the CBOR file. The default mode is PERMISSIVE.

Options

  • base64_decoded (default: False): Whether to decode base64 encoded values before parsing with CBOR. Will always decode if file ends with .b64

Limitations

  • Not yet used anywhere except on local machine
  • Nested structures are recursively parsed. This means that the maximum depth of the nested structure is limited by the maximum recursion depth of Python.

Permissive mode

Spark converts CBOR data into an Arrow DataFrame based on the provided schema. Not all CBOR data can be represented in the schema.
This library tries to preserve data as much as possible, even if the schema doesn't match. However, the following happens in permissive mode:

  • If a field is undefined, it will be set to null
  • If a field is defined but the value is not present, it will be set to null.

Integers will be set to null if they exceed the maximum value of the corresponding Spark type:

  • IntegerType: 2147483647

  • LongType: 9223372036854775807

  • In DecimalType, the precision is limited to 38 digits. infinity and NaN are not supported and will be converted to null.

  • Not sure if I implemented the CBOR TAGS and other special types correctly. Might or might not work as expected.

  • Let me know if you find any issues.

TODOs

  • Add more tests
  • Add StreamReader
  • Add Writer
  • Add StreamWriter
  • Add more options, such as RESTRICTIVE mode
  • Add more examples
  • Add more documentation
  • Add more error handling
  • Support string ddl schema specification
  • Add more logging
  • Add more performance optimizations: e.g., can file splitting be done?

Contributing

Feel free to contribute to this project. As you can see there is still a lot of work to be done.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_cbor-0.1.1.tar.gz (74.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_cbor-0.1.1-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_cbor-0.1.1.tar.gz.

File metadata

  • Download URL: pyspark_cbor-0.1.1.tar.gz
  • Upload date:
  • Size: 74.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pyspark_cbor-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9aefd11e4a4c7979bfa22de252d9ad39d7d2f2c3d683bb48a8afa82f0513b363
MD5 36377ca62f6e73d5573da5818a38280f
BLAKE2b-256 bf08dc4318393e77f8b4dc339bf1556459588cc983330aca6ce2cf25161f2870

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_cbor-0.1.1.tar.gz:

Publisher: build-and-publish.yml on dan1elt0m/pyspark-cbor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyspark_cbor-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pyspark_cbor-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pyspark_cbor-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4ca294e88826b8e28f90a9f0de5ed8d03d5c69f2c134f44c1843a4e29287e4cc
MD5 08d9123fac0c57aa0ac7f05780facf02
BLAKE2b-256 c55437fee7369e8dbc3c78f74e594004335ca96338e50f0da1b84ce92e7dc55a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_cbor-0.1.1-py3-none-any.whl:

Publisher: build-and-publish.yml on dan1elt0m/pyspark-cbor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page