Skip to main content

Library for extraction of data from WMS/WFS endpoints

Project description

wmsdump PyPI - Latest Version GitHub Tag

A library and command-line tool for extracting vector layer data from OGC services (WMS, WFS).

Note: This tool only supports vector layers. Raster layers are not supported.

Features

  • Supports WMS and WFS: Extracts data from both Web Map Service (WMS) and Web Feature Service (WFS) endpoints.
  • Flexible Retrieval Modes: Offers OFFSET (paged retrieval) and EXTENT (bbox splitting and drilling down by spatial extent) retrieval modes for efficient data extraction, including handling deduplication with the EXTENT mode.
  • Multiple Retrieval Formats: Supports KML and GeoRSS formats when retrieving data from WMS GetMap operations. Output is always in Geojsonl(GeoJSONSeq)
  • Geometry Precision Control: Allows truncating geometry coordinates to a specified decimal point precision.
  • State Management: Persists extraction state to allow resuming interrupted downloads.
  • Geoserver and QGIS Server Flavor Support: Handles vendor-specific differences for GetFeatureInfo based retrieval from WMS.
  • Error Handling: Provides informative error messages and handles common service exceptions.
  • Configuration: Customizable through command-line options.
  • KML Postprocessing: Offers options to strip superflous points in Polygon/LineString geometry collections and whether to keep original style related props.
  • Hole Punching: Includes a utility to remove overlap in polygons by punching holes to deal with shortcomings of GeoRSS based retrieval
  • Capabilities Exploration: Can explore services via a GetCapabilities request or by scraping the Geoserver webpage. Partial parsing of incomplete/corrupt capabilitie.xml response is supported

Installation

  1. Using pip:

    pip install wmsdump
    
  2. Using uv (recommended):

    wmsdump uses uv for package management and dependency resolution. uv is a faster alternative to pip.

    Installing uv - https://docs.astral.sh/uv/getting-started/installation

    # Install dependencies using uv
    uv pip install wmsdump
    

    You can also use the tools directly by running

    uvx --from wmsdump wms-extractor <args>
    

    uv creates a temporary virtualenv and manages your dependencies in this invocation.

    For the optional punch-holes feature( needed for using the punch-holes utility ), use:

    uv pip install wmsdump[punch-holes]
    

    or

    pip install wmsdump[punch-holes]
    

    For the optional proj feature( needed for retrieving data in projections other than EPSG:4326 or EPSG:3857 ), use:

    uv pip install wmsdump[proj]
    

    or

    pip install wmsdump[proj]
    

Usage

wmsdump provides a command-line tool wms-extractor with two main commands: explore and extract.

Common Options

The following options are available on both explore and extract subcommands:

  • --log-level: Log level. One of DEBUG,INFO,WARNING,ERROR,CRITICAL. Defaults to INFO.
  • --no-ssl-verify: switch off ssl verification for all network calls.
  • --request-timeout: timeout for the http requests in seconds. Default is no timeout.
  • --header: Header to be added to all network requests, in the format "Key:Value". Can be used multiple times.

1. Explore

The explore command helps discover available layers and service information.

wms-extractor explore --help

Options:

  • --geoserver-url: URL of the GeoServer endpoint. The WMS endpoint is assumed to be <geoserver_url>/ows.
  • --service-url: URL of the WMS/WFS endpoint from which to probe for capabilities. If not provided, it will be derived from geoserver-url.
  • --service: Service to use (WMS or WFS). Defaults to WFS.
  • --service-version: The protocol version to use. Defaults to '1.1.1' for WMS and '1.0.0' for WFS.
  • --namespace: Only look for layers in a given namespace (Geoserver specific).
  • --output-file: File to write the layer list to.
  • --scrape-webpage: Scrape the GeoServer web page instead of reading capabilities. Useful when capabilities are broken.

Examples:

# Explore WFS layers from a GeoServer endpoint
wms-extractor explore --geoserver-url http://example.com/geoserver

# Explore WMS layers from a specific URL
wms-extractor explore --service-url http://example.com/wms --service WMS

# Scrape the GeoServer web page for layers
wms-extractor explore --geoserver-url http://example.com/geoserver --scrape-webpage

# Write layer list to a file
wms-extractor explore --geoserver-url http://example.com/geoserver --output-file layers.txt

2. Extract

The extract command extracts data from a specified layer.

wms-extractor extract --help

Arguments:

  • LAYERNAME: Name of the layer to extract.
  • OUTPUT_FILE: Output file to write the GeoJSONl features to. If not provided, a filename will be derived from the LAYERNAME.

Options:

  • --output-dir: Directory to write output files in (only used when OUTPUT_FILE is not given). Defaults to the current directory.
  • --geoserver-url: URL of the GeoServer endpoint. service-url is assumed to be <geoserver_url>/[<layer_namespace>/]ows.
  • --service-url: URL of the WMS/WFS endpoint from which to retrieve data. If not provided, it will be derived from geoserver-url.
  • --service: Service to use (WMS or WFS). Defaults to WFS.
  • --service-version: The protocol version to use. Defaults to '1.1.1' for WMS and '1.0.0' for WFS.
  • --retrieval-mode: Which method to use for batch record retrieval (OFFSET, EXTENT, or EXTENT_FIXED_BUFFER). Defaults to OFFSET.
  • --operation: Which operation to use for querying the service. WMS supports GetMap or GetFeatureInfo; WFS uses GetFeature (auto-selected). Defaults to GetMap for WMS.
  • --flavor: Vendor of the WMS service (Geoserver or QGISserver), useful to specify for GetFeatureInfo based retrieval. Defaults to Geoserver.
  • --sort-key: Key to use for paged retrieval (required when server requires it).
  • --batch-size: Batch size to use for retrieval. Defaults to 1000.
  • --pause-seconds: Amount of time to pause between a batch of requests. Defaults to 2.
  • --requests-to-pause: Number of requests to make before pausing. Defaults to 10.
  • --max-attempts: Number of times to attempt a request before giving up. Defaults to 5.
  • --retry-delay: Number of seconds to wait before retrying on failure (delay is incremented for each failure). Defaults to 5.
  • --geometry-precision: Decimal point precision of geometry to be returned (-1 means no truncation). Defaults to -1.
  • --getmap-format: Format to use while pulling using WMS GetMap (KML or GEORSS). Defaults to KML.
  • --kml-strip-point: Whether to strip the points in polygons and linestring geomcollections (KML specific). Defaults to True.
  • --kml-keep-original-props: Whether to keep the original style-related properties in KML conversion. Defaults to False.
  • --out-srs: CRS to request data in. Defaults to EPSG:4326.
  • --bounds: Bounding box to restrict the query to (format: <xmin>,<ymin>,<xmax>,<ymax>).
  • --max-box-dims: When querying using EXTENT mode, the maximum size of the bounding box to use (format: <deltax>,<deltay>).
  • --fixed-buffer: Pixel buffer size for EXTENT_FIXED_BUFFER mode. Required when using EXTENT_FIXED_BUFFER retrieval mode with GetFeatureInfo.
  • --wms-map-size: Virtual map size in pixels for WMS requests (default 256). Primarily affects GetFeatureInfo calls where it determines the query point and buffer calculations.
  • --custom-dumper: Path to a Python file containing a SpecialDumper class that subclasses OGCServiceDumper to override default behavior.
  • --skip-index: Skip n elements in index (useful to skip records causing failure, only applicable for OFFSET retrieval). Defaults to 0.

Examples:

# Extract data from a WFS layer
wms-extractor extract my_layer output.geojsonl --geoserver-url http://example.com/geoserver

# Extract data from a WMS layer using GetMap with GeoRSS format
wms-extractor extract my_layer output.geojsonl --service WMS --service-url http://example.com/wms --getmap-format GEORSS

# Extract data and truncate geometry to 3 decimal places
wms-extractor extract my_layer output.geojsonl --geoserver-url http://example.com/geoserver --geometry-precision 3

# Extract data with bounding box
wms-extractor extract my_layer output.geojsonl --geoserver-url http://example.com/geoserver --bounds -180,-90,180,90

3. Deduplicate GeoJSONL

This command removes duplicate features from a GeoJSONL file. Features are considered duplicates if they have identical geometry and properties. Deduplication is performed by hashing features and detecting collisions.

geojsonl-dedupe --help

Arguments:

  • INPUT-FILE: The input GeoJSONl file to deduplicate (required)
  • OUTPUT-FILE: The output GeoJSONl file. If not provided, writes to deduped_<INPUT-FILE>

Options:

  • --log-level, -l: Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL). Defaults to INFO.
  • --use-offset/--use-ram: Use file offset for collision checks (default) or keep features in RAM. Using file offset is more memory-efficient for large files.

Example:

# Deduplicate using file offset method (memory-efficient)
geojsonl-dedupe input.geojsonl output.geojsonl

# Deduplicate keeping features in RAM (faster but uses more memory)
geojsonl-dedupe input.geojsonl output.geojsonl --use-ram

# Auto-generate output filename
geojsonl-dedupe input.geojsonl

Note: The EXTENT retrieval mode includes built-in deduplication to handle features that may appear in overlapping spatial extents during extraction. This tool is useful for post-processing or cleaning up data from other sources.

4. Punch Holes (Optional)

This command is available if installed with the punch-holes extra. It removes overlaps in a GeoJSONl file by punching holes where polygons overlap. This is useful for cleaning up data problems which happen when extracting data using GeoRSS format which cannot represent polygons with holes.

punch-holes --help

Arguments:

  • INPUT_FILE: The input GeoJSONl file to process
  • OUTPUT_FILE: The output GeoJSONl file. If none provided, writes the results to fixed_<INPUT_FILE>

Options:

  • --index-in-mem: Whether the spatial index keeps the geometry data in memory or just the offset of the features on disk.
  • --keep-map-file: Whether to keep the overlap map temporary file (debugging purposes).

Example:

punch-holes input.geojsonl output.geojsonl

State Management

wmsdump automatically creates a .state file alongside the output file. This file stores the progress of the extraction. If the extraction is interrupted, wmsdump will resume from the last known state when run again with the same parameters. To start a new extraction, delete both the output file and the .state file.

Environment Variables

  • WMSDUMP_SAVE_RESPONSE_TO_FILE: If set, the raw HTTP response from the OGC service will be saved to the specified file. This is useful for debugging.

Dependencies

  • bs4 (Beautiful Soup 4)
  • click
  • colorlog
  • jsonschema
  • kml2geojson
  • requests
  • xmltodict

Optional:

  • geoindex-rs (required for punch-holes)
  • numpy (required for punch-holes)
  • shapely (required for punch-holes)
  • pyproj (required for handling some CRS definitions)

Contributing

Contributions are welcome! Please submit bug reports, feature requests, and pull requests through GitHub.

License

This project is released under UnLicense - see the LICENSE file for details.

Credits

This was heavily inspired by a similar tool for ESRI endpoints - openaddresses/pyesridump

Also, that this is possible was pointed out to me by datta07, some of the georss parsing code was also based on prior work by datta07, answerquest and devdattaT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wmsdump-0.1.5.tar.gz (89.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wmsdump-0.1.5-py3-none-any.whl (35.6 kB view details)

Uploaded Python 3

File details

Details for the file wmsdump-0.1.5.tar.gz.

File metadata

  • Download URL: wmsdump-0.1.5.tar.gz
  • Upload date:
  • Size: 89.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for wmsdump-0.1.5.tar.gz
Algorithm Hash digest
SHA256 9e70e7195178256c0a22f96a0086f77b39cb09680d9957ee964574dd1ccb4e6f
MD5 cd4994dbd4d20c181d3317024be52082
BLAKE2b-256 441173a0f57c66d4bb5af798514ca43e3dfc2449d9285fd5a5f9ec99728615e1

See more details on using hashes here.

File details

Details for the file wmsdump-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: wmsdump-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for wmsdump-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 931d8f1dce2cc6a835f3fe5c65940df4c0d676979c29aee22cd27202a4d7f87c
MD5 0e5136208b7c73ce2c9949de1abae2fd
BLAKE2b-256 5f37a6382dd1a395ce2d15aadb26622b3cb552fade56b808c8e0170e1c2d3c07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page