Tekumara build of Apache Spark with Hadoop 3.1
Project description
Tekumara build of Apache PySpark with Hadoop 3.x
A build of Apache PySpark that uses the hadoop-cloud maven profile to bundle hadoop-aws 3.x which contains S3A.
Install
See Releases
Usage
To use pyspark with temporary STS credentials:
pyspark --driver-java-options "-Dspark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"
To modify an existing spark session to use S3A for S3 urls, for example spark
in the pyspark shell:
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
See test_s3a.py for an example of using the staging committers.
Rationale
The pyspark distribution on pypi ships with hadoop 2.7 and no cloud jars (ie: hadoop-aws). So common practice is to use hadoop-aws 2.7.3 as follows:
pyspark --packages "org.apache.hadoop:hadoop-aws:2.7.3" --driver-java-options "-Dspark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem"
However, later versions of hadoop-aws cannot be used this way without errors.
This project builds a pyspark distribution from source with Hadoop 3.x.
Later versions of hadoop-aws contain the following new features:
- 2.8 release line contains S3A improvements to support any AWSCredentialsProvider
- 2.9 release line contains S3Guard which provides consistency and metadata caching for S3A via a backing DynamoDB metadata store.
- 3.1 release line incorporates HADOOP-13786 which contains optimised job committers including the Netflix staging committers (Directory and Partitioned) and the Magic committers. See committers and committer architecture.
- 3.2 release line an enhanced S3A connector and S3Guard, including better resilience to throttled AWS S3 and DynamoDB IO.
To take advantage of the 3.x release line committers in Spark you also need the binding classes introduced into Spark 3.0.0 by SPARK-23977. For Spark 2.4, the HortonWorks backport is used from the Hortonworks repo.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pyspark-cloud-0.0.0.tar.gz
.
File metadata
- Download URL: pyspark-cloud-0.0.0.tar.gz
- Upload date:
- Size: 463.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8962d07504fdc8939a2944a7d8ca5971c82400b068326427b5bb795e60d242eb |
|
MD5 | 3b43ff7f1fc3b06c4df2da76835e3bf4 |
|
BLAKE2b-256 | 4beb52b50d6cb59a6a71d9e091f06b8f3157a20ba4c6050fc2717743866eb9b6 |