PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

uber.github.io/h3-py

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Extension Functions

There are also various extension functions available for geospatial common operations which are not available in the vanilla H3 library.

Assumptions

You use GeoJSON to represent geometries in your PySpark pipeline (as opposed to WKT)
Geometries are stored in a GeoJSON string within a column (such as geometry) in your PySpark dataset
Individual H3 cells are stored as a string column (such as h3_9)
Sets of H3 cells are stored in an array(string) column (such as h3_9)

Indexing

`index_shape(geometry: Column, resolution: Column)`

Generate an H3 spatial index for an input GeoJSON geometry column.

This function accepts GeoJSON Point, LineString, Polygon, MultiPoint, MultiLineString, and MultiPolygon input features, and returns the set of H3 cells at the specified resolution which completely cover them (could be more than one cell for a substantially large geometry and substantially granular resolution).

The schema of the output column will be T.ArrayType(T.StringType()), where each value in the array is an H3 cell.

This spatial index can then be used for bucketing, clustering, and joins in Spark via an explode() operation.

>>> from pyspark.sql import SparkSession, functions as F
>>> from h3_pyspark.indexing import index_shape
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> df = spark.createDataFrame([{
        'geometry': '{ "type": "MultiPolygon", "coordinates": [ [ [ [ -80.79442262649536, 32.13522895845023 ], [ -80.79298496246338, 32.13522895845023 ], [ -80.79298496246338, 32.13602844594619 ], [ -80.79442262649536, 32.13602844594619 ], [ -80.79442262649536, 32.13522895845023 ] ] ], [ [ [ -80.7923412322998, 32.1330848437511 ], [ -80.79073190689087, 32.1330848437511 ], [ -80.79073190689087, 32.13375715632646 ], [ -80.7923412322998, 32.13375715632646 ], [ -80.7923412322998, 32.1330848437511 ] ] ] ] }',

        'resolution': 9
    }])
>>>
>>> df = df.withColumn('h3_9', index_shape('geometry', 'resolution'))
>>> df.show()
+----------------------+----------+------------------------------------+
|              geometry|resolution|                                h3_9|
+----------------------+----------+------------------------------------+
| { "type": "MultiP... |         9| [8944d551077ffff, 8944d551073ffff] |
+----------------------+----------+------------------------------------+

Optionally, add another column h3_9_geometry for the GeoJSON representation of each cell in the h3_9 column to easily map the result alongside your original input geometry:

>>> df = df.withColumn('h3_9_geometry', h3_pyspark.h3_set_to_multi_polygon(F.col('h3_9'), F.lit(True)))

View Live Map on GitHub

Buffers

`k_ring_distinct(cells: Column, distance: Column)`

Takes in an array of input cells, perform a k-ring operation on each cell, and return the distinct set of output cells.

The schema of the output column will be T.ArrayType(T.StringType()), where each value in the array is an H3 cell.

Since we know the edge length & diameter (2 * edge length) of each H3 cell resolution, we can use this to efficiently generate a "buffered" index of our input geometry (useful for operations such as distance joins):

>>> from pyspark.sql import SparkSession, functions as F
>>> from h3_pyspark.indexing import index_shape
>>> from h3_pyspark.traversal import k_ring_distinct
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> df = spark.createDataFrame([{
        'geometry': '{ "type": "MultiPolygon", "coordinates": [ [ [ [ -80.79442262649536, 32.13522895845023 ], [ -80.79298496246338, 32.13522895845023 ], [ -80.79298496246338, 32.13602844594619 ], [ -80.79442262649536, 32.13602844594619 ], [ -80.79442262649536, 32.13522895845023 ] ] ], [ [ [ -80.7923412322998, 32.1330848437511 ], [ -80.79073190689087, 32.1330848437511 ], [ -80.79073190689087, 32.13375715632646 ], [ -80.7923412322998, 32.13375715632646 ], [ -80.7923412322998, 32.1330848437511 ] ] ] ] }',

        'resolution': 9
    }])
>>>
>>> df = df.withColumn('h3_9', index_shape('geometry', 'resolution'))
>>> df = df.withColumn('h3_9_buffer', k_ring_distinct('h3_9', 1))
>>> df.show()
+--------------------+----------+--------------------+--------------------+
|            geometry|resolution|                h3_9|         h3_9_buffer|
+--------------------+----------+--------------------+--------------------+
|{ "type": "MultiP...|         9|[8944d551077ffff,...|[8944d551073ffff,...|
+--------------------+----------+--------------------+--------------------+

View Live Map on GitHub

Spatial Joins

Once we have an indexed version of our geometries, we can easily join on the string column in H3 to get a set of pair candidates:

>>> from pyspark.sql import SparkSession, functions as F
>>> from h3_pyspark.indexing import index_shape
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> left = spark.createDataFrame([{
        'id': 'left_point',
        'geometry': '{ "type": "Point", "coordinates": [ -80.79527020454407, 32.132884966083935 ] }',
    }])
>>> right = spark.createDataFrame([{
        'id': 'right_polygon',
        'geometry': '{ "type": "Polygon", "coordinates": [ [ [ -80.80022692680359, 32.12864200501338 ], [ -80.79224467277527, 32.12864200501338 ], [ -80.79224467277527, 32.13378441213715 ], [ -80.80022692680359, 32.13378441213715 ], [ -80.80022692680359, 32.12864200501338 ] ] ] }',
    }])
>>>
>>> left = left.withColumn('h3_9', index_shape('geometry', F.lit(9)))
>>> right = right.withColumn('h3_9', index_shape('geometry', F.lit(9)))
>>>
>>> left = left.withColumn('h3_9', F.explode('h3_9'))
>>> right = right.withColumn('h3_9', F.explode('h3_9'))
>>>
>>> joined = left.join(right, on='h3_9', how='inner')
>>> joined.show()
+---------------+--------------------+----------+--------------------+-------------+
|           h3_9|            geometry|        id|            geometry|           id|
+---------------+--------------------+----------+--------------------+-------------+
|8944d55100fffff|{ "type": "Point"...|left_point|{ "type": "Polygo...|right_polygon|
+---------------+--------------------+----------+--------------------+-------------+

You can combine this technique with a Buffer to do a Distance Join.

View Live Map on GitHub

Publishing

Bump version in setup.cfg
Publish:

python3 -m build
python3 -m twine upload --repository pypi dist/*

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.2.6

Mar 10, 2022

1.2.4

Mar 4, 2022

1.2.3

Feb 24, 2022

1.2.2

Jan 5, 2022

1.2.1

Jan 5, 2022

1.2.0

Jan 4, 2022

This version

1.1.0

Dec 6, 2021

1.0.0

Nov 25, 2021

0.0.1

Nov 25, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

h3-pyspark-1.1.0.tar.gz (12.4 kB view hashes)

Uploaded Dec 6, 2021 Source

Built Distribution

h3_pyspark-1.1.0-py3-none-any.whl (11.6 kB view hashes)

Uploaded Dec 6, 2021 Python 3

Hashes for h3-pyspark-1.1.0.tar.gz

Hashes for h3-pyspark-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1efd3db4a61f2adad75b17b8199bf753d39e0797efd7ea7727d90139b72587ba`
MD5	`9c4677b9fda3122016bfb3ccb5433f86`
BLAKE2b-256	`622aadb5624252bc5604b290e845c9e297ef33d1af711d495490da077ca40d4c`

Hashes for h3_pyspark-1.1.0-py3-none-any.whl

Hashes for h3_pyspark-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad0e19c27f94baaaf913207d219b512fa6f3cf8b37e34655f3bc1e3ceb6d8b03`
MD5	`21942f960c6b9064e684a7b383a227ea`
BLAKE2b-256	`7d5af54c7dc945db52dff26087cda013f98572d420a8466a9815dcfcdb21d319`