Reading and writing CBOR files with PySpark
Project description
pyspark-cbor
By no means ready for production use. This project is still in development. See the TODOs section for more information.
This library implements custom Spark data source cbor built using the new Python Data Source API for the upcoming Apache Spark 4.0 release.
For an in-depth understanding of the API, please refer to the API source code.
Supported features
- Support all CBOR data types, including nested structures. Caveat: CBOR is more flexible than Spark's schema, so some data may be lost. See Permissive mode section for more information.
- Read CBOR file(s) (in parallel) with a specified schema
- Read CBOR file(s) with base64 encoded values
- Read CBOR file(s) from a local filepath or azure, aws or gcp storage
Installation
pip install pyspark-cbor==0.1.0
Example
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark_cbor import CBORDataSource
# Initialize Spark session
spark = SparkSession.builder
.appName("CBOR Data Source Example")
.getOrCreate()
# Register the CBOR data source
spark.dataSource.register(CBORDataSource)
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])
# Read CBOR file
df = spark.read.format("cbor").schema(schema).load("path/to/your/file.cbor")
# Show the DataFrame
df.show()
Settings
path: The path to the CBOR file or directory containing CBOR files.schema: The schema of the CBOR file. Currently, the schema can't be specified as stringmode: The mode of reading the CBOR file. The default mode isPERMISSIVE.
Options
base64_decoded(default:False): Whether to decode base64 encoded values before parsing with CBOR. Will always decode if file ends with.b64
Limitations
- Not yet used anywhere except on local machine
- Nested structures are recursively parsed. This means that the maximum depth of the nested structure is limited by the maximum recursion depth of Python.
Permissive mode
Spark converts CBOR data into an Arrow DataFrame based on the provided schema.
Not all CBOR data can be represented in the schema.
This library tries to preserve data as much as possible, even if the schema doesn't match.
However, the following happens in permissive mode:
- If a field is undefined, it will be set to
null - If a field is defined but the value is not present, it will be set to
null.
Integers will be set to null if they exceed the maximum value of the corresponding Spark type:
-
IntegerType:
2147483647 -
LongType:
9223372036854775807 -
In DecimalType, the precision is limited to 38 digits. infinity and NaN are not supported and will be converted to
null. -
Not sure if I implemented the
CBOR TAGSand other special types correctly. Might or might not work as expected. -
Let me know if you find any issues.
TODOs
- Add more tests
- Add StreamReader
- Add Writer
- Add StreamWriter
- Add more options, such as
RESTRICTIVEmode - Add more examples
- Add more documentation
- Add more error handling
- Support string ddl schema specification
- Add more logging
- Add more performance optimizations: e.g., can file splitting be done?
Contributing
Feel free to contribute to this project. As you can see there is still a lot of work to be done.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyspark_cbor-0.1.1.tar.gz.
File metadata
- Download URL: pyspark_cbor-0.1.1.tar.gz
- Upload date:
- Size: 74.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9aefd11e4a4c7979bfa22de252d9ad39d7d2f2c3d683bb48a8afa82f0513b363
|
|
| MD5 |
36377ca62f6e73d5573da5818a38280f
|
|
| BLAKE2b-256 |
bf08dc4318393e77f8b4dc339bf1556459588cc983330aca6ce2cf25161f2870
|
Provenance
The following attestation bundles were made for pyspark_cbor-0.1.1.tar.gz:
Publisher:
build-and-publish.yml on dan1elt0m/pyspark-cbor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyspark_cbor-0.1.1.tar.gz -
Subject digest:
9aefd11e4a4c7979bfa22de252d9ad39d7d2f2c3d683bb48a8afa82f0513b363 - Sigstore transparency entry: 186572300
- Sigstore integration time:
-
Permalink:
dan1elt0m/pyspark-cbor@89fc408d1c2df8fc06a54dcd6d603b6c47fba128 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/dan1elt0m
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-publish.yml@89fc408d1c2df8fc06a54dcd6d603b6c47fba128 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pyspark_cbor-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pyspark_cbor-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ca294e88826b8e28f90a9f0de5ed8d03d5c69f2c134f44c1843a4e29287e4cc
|
|
| MD5 |
08d9123fac0c57aa0ac7f05780facf02
|
|
| BLAKE2b-256 |
c55437fee7369e8dbc3c78f74e594004335ca96338e50f0da1b84ce92e7dc55a
|
Provenance
The following attestation bundles were made for pyspark_cbor-0.1.1-py3-none-any.whl:
Publisher:
build-and-publish.yml on dan1elt0m/pyspark-cbor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyspark_cbor-0.1.1-py3-none-any.whl -
Subject digest:
4ca294e88826b8e28f90a9f0de5ed8d03d5c69f2c134f44c1843a4e29287e4cc - Sigstore transparency entry: 186572303
- Sigstore integration time:
-
Permalink:
dan1elt0m/pyspark-cbor@89fc408d1c2df8fc06a54dcd6d603b6c47fba128 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/dan1elt0m
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-publish.yml@89fc408d1c2df8fc06a54dcd6d603b6c47fba128 -
Trigger Event:
release
-
Statement type: