Read vector files into a Spark DataFrame with geometry encoded as WKB.
Project description
PySpark Vector Files
Read vector files into a Spark DataFrame with geometry encoded as Well Known Binary (WKB).
Full documentation is available here.
Requirements
This library was developed using Databricks Runtime 10.4 LTS and uses the versions of python
, pandas
and pyspark
that come pre-installed on that runtime. However, it also requires GDAL 3.4.3
as this is the most recent version of GDAL
available from ubuntugis-unstable as of 2022-08-11.
You can install GDAL
on your cluster using an init script. See here for an example.
Install pyspark-vector-files
Within a Databricks notebook
%pip install pyspark-vector-files
From the command line
python -m pip install pyspark-vector-files
Quick start
Read the first layer from a file or files with given extension into a single Spark DataFrame:
from pyspark_vector_files import read_vector_files
sdf = read_vector_files(
path="/path/to/files/",
suffix=".ext",
)
More examples are available here.
Local development
To ensure compatibility with Databricks Runtime 10.4 LTS, this package was developed on a Linux machine running the Ubuntu 20.04 LTS
operating system using Python3.8.10
, GDAL 3.4.3
, and spark 3.2.1.
.
Install Python 3.8.10
using pyenv
See the pyenv-installer
's Installation / Update / Uninstallation instructions.
Install Python 3.8.10 globally:
pyenv install 3.8.10
Then install it locally in the repository you're using:
pyenv local 3.8.10
Install GDAL 3.4.3
Add the UbuntuGIS unstable Private Package Archive (PPA) and update your package list:
sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable \
&& sudo apt-get update
Install gdal 3.4.3
, I found I also had to install python3-gdal (even though
I'm going to use poetry to install it in a virtual environment later) to
avoid version conflicts:
sudo apt-get install -y gdal-bin=3.4.3+dfsg-1~focal0 \
libgdal-dev=3.4.3+dfsg-1~focal0 \
python3-gdal=3.4.3+dfsg-1~focal0
Verify the installation:
ogrinfo --version
# GDAL 3.4.3, released 2022/04/22
Install poetry 1.1.13
See poetry's osx / linux / bashonwindows install instructions
Clone this repository
git clone https://github.com/Defra-Data-Science-Centre-of-Excellence/pyspark_vector_files.git
Install dependencies using poetry
poetry install
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyspark_vector_files-0.2.5.tar.gz
.
File metadata
- Download URL: pyspark_vector_files-0.2.5.tar.gz
- Upload date:
- Size: 16.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.8.10 Linux/5.15.0-1027-aws
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 933e8384fa5afa519d6f314352c43e7a6bcf48a7b894249310dd42eb6439f751 |
|
MD5 | 5d365ac0dcbc7e2f214c849760a9aee1 |
|
BLAKE2b-256 | d90a3270c3e1cf4e74fdd0733fd7845bab2e57a0985d20143f91322feac1cd3d |
File details
Details for the file pyspark_vector_files-0.2.5-py3-none-any.whl
.
File metadata
- Download URL: pyspark_vector_files-0.2.5-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.8.10 Linux/5.15.0-1027-aws
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4a10d4e71e7c5eb08bd30934efe3c313b03bf441175569e6b807a27f87bdd93 |
|
MD5 | 46dcd32f461d0b421f46b75d1cc347b9 |
|
BLAKE2b-256 | 142be41c582e882866c8a55662bbea1e9af25d3b752e864babae4bc59848359e |