Skip to main content

Read vector files into a Spark DataFrame with geometry encoded as WKB.

Project description

CI

PySpark Vector Files

Read vector files into a Spark DataFrame with geometry encoded as Well Known Binary (WKB).

Full documentation is available here.

Requirements

This library was developed using Databricks Runtime 10.4 LTS and uses the versions of python, pandas and pyspark that come pre-installed on that runtime. However, it also requires GDAL 3.4.3 as this is the most recent version of GDAL available from ubuntugis-unstable as of 2022-08-11.

You can install GDAL on your cluster using an init script. See here for an example.

Install pyspark-vector-files

Within a Databricks notebook

%pip install pyspark-vector-files

From the command line

python -m pip install pyspark-vector-files

Quick start

Read the first layer from a file or files with given extension into a single Spark DataFrame:

from pyspark_vector_files import read_vector_files

sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".ext",
)

More examples are available here.

Local development

To ensure compatibility with Databricks Runtime 10.4 LTS, this package was developed on a Linux machine running the Ubuntu 20.04 LTS operating system using Python3.8.10, GDAL 3.4.3, and spark 3.2.1..

Install Python 3.8.10 using pyenv

See the pyenv-installer's Installation / Update / Uninstallation instructions.

Install Python 3.8.10 globally:

pyenv install 3.8.10

Then install it locally in the repository you're using:

pyenv local 3.8.10

Install GDAL 3.4.3

Add the UbuntuGIS unstable Private Package Archive (PPA) and update your package list:

sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable \
    && sudo apt-get update

Install gdal 3.4.3, I found I also had to install python3-gdal (even though I'm going to use poetry to install it in a virtual environment later) to avoid version conflicts:

sudo apt-get install -y gdal-bin=3.4.3+dfsg-1~focal0 \
    libgdal-dev=3.4.3+dfsg-1~focal0 \
    python3-gdal=3.4.3+dfsg-1~focal0

Verify the installation:

ogrinfo --version
# GDAL 3.4.3, released 2022/04/22

Install poetry 1.1.13

See poetry's osx / linux / bashonwindows install instructions

Clone this repository

git clone https://github.com/Defra-Data-Science-Centre-of-Excellence/pyspark_vector_files.git

Install dependencies using poetry

poetry install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_vector_files-0.2.5.tar.gz (16.8 kB view hashes)

Uploaded Source

Built Distribution

pyspark_vector_files-0.2.5-py3-none-any.whl (17.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page