Skip to main content

Read vector files into a Spark DataFrame with geometry encoded as WKB.

Project description

CI

PySpark Vector Files

Read vector files into a Spark DataFrame with geometry encoded as Well Known Binary (WKB).

Full documentation is available here.

Requirements

This library was developed using Databricks Runtime 10.4 LTS and uses the versions of python, pandas and pyspark that come pre-installed on that runtime. However, it also requires GDAL 3.4.3 as this is the most recent version of GDAL available from ubuntugis-unstable as of 2022-08-11.

You can install GDAL on your cluster using an init script. See here for an example.

Install pyspark-vector-files

Within a Databricks notebook

%pip install pyspark-vector-files

From the command line

python -m pip install pyspark-vector-files

Quick start

Read the first layer from a file or files with given extension into a single Spark DataFrame:

from pyspark_vector_files import read_vector_files

sdf = read_vector_files(
    path="/path/to/files/",
    suffix=".ext",
)

More examples are available here.

Local development

To ensure compatibility with Databricks Runtime 10.4 LTS, this package was developed on a Linux machine running the Ubuntu 20.04 LTS operating system using Python3.8.10, GDAL 3.4.3, and spark 3.2.1..

Install Python 3.8.10 using pyenv

See the pyenv-installer's Installation / Update / Uninstallation instructions.

Install Python 3.8.10 globally:

pyenv install 3.8.10

Then install it locally in the repository you're using:

pyenv local 3.8.10

Install GDAL 3.4.3

Add the UbuntuGIS unstable Private Package Archive (PPA) and update your package list:

sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable \
    && sudo apt-get update

Install gdal 3.4.3, I found I also had to install python3-gdal (even though I'm going to use poetry to install it in a virtual environment later) to avoid version conflicts:

sudo apt-get install -y gdal-bin=3.4.3+dfsg-1~focal0 \
    libgdal-dev=3.4.3+dfsg-1~focal0 \
    python3-gdal=3.4.3+dfsg-1~focal0

Verify the installation:

ogrinfo --version
# GDAL 3.4.3, released 2022/04/22

Install poetry 1.1.13

See poetry's osx / linux / bashonwindows install instructions

Clone this repository

git clone https://github.com/Defra-Data-Science-Centre-of-Excellence/pyspark_vector_files.git

Install dependencies using poetry

poetry install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_vector_files-0.2.5.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

pyspark_vector_files-0.2.5-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_vector_files-0.2.5.tar.gz.

File metadata

  • Download URL: pyspark_vector_files-0.2.5.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.8.10 Linux/5.15.0-1027-aws

File hashes

Hashes for pyspark_vector_files-0.2.5.tar.gz
Algorithm Hash digest
SHA256 933e8384fa5afa519d6f314352c43e7a6bcf48a7b894249310dd42eb6439f751
MD5 5d365ac0dcbc7e2f214c849760a9aee1
BLAKE2b-256 d90a3270c3e1cf4e74fdd0733fd7845bab2e57a0985d20143f91322feac1cd3d

See more details on using hashes here.

File details

Details for the file pyspark_vector_files-0.2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_vector_files-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b4a10d4e71e7c5eb08bd30934efe3c313b03bf441175569e6b807a27f87bdd93
MD5 46dcd32f461d0b421f46b75d1cc347b9
BLAKE2b-256 142be41c582e882866c8a55662bbea1e9af25d3b752e864babae4bc59848359e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page