Extract nodes from an OSM file
Project description
ExtractOSM
Overview
ExtractOSM is a utility that performs a filtered extract of OpenStreetMap (OSM) data into a CSV file. It filters OSM nodes based on supplied categories, extracts specified fields, optionally normalizes text, and can enrich the output with additional attributes. The result is a dataset for:
- Training and applying classification or regression models
- Feeding structured inputs into GIS platforms or PostGIS databases
🟢 Features
- Filters user-specified OSM categories using
osmiumfor high-performance extraction. - Extracts a fixed set of core fields (e.g.,
osm_id,item_name,lat,lon). - Extracts a configurable set of feature fields from tag keys with these modes:
- Numeric parsing
- Binary presence indicator
- Computes additional derived fields:
tag_count— the number of significant tags after applying tag filter rules
- Optionally applies text normalization using regex-based substitutions from YAML configuration.
- Optionally standardizes units for fields such as distance. This will convert fields tagged as "ft", etc. to meters.
- Supports creating loading external enrichment data via CSV files keyed by
osm_id. - Outputs a structured CSV file for use in machine learning pipelines, PostGIS, or statistical tools.
- Includes an optional tool for finding the wikipedia length of articles for each item
- Includes an optional tool for calculating lat/lon and area for polygons in an OSM file
🔍 Details
Core Fields
Core fields are always included in the output:
osm_id— Unique identifier of the OSM node.item_name— Value of thenametag, if present.lat,lon— Coordinates of the node, if available.osm_category— Assigned category based on OSM tags.
Configured Fields
Additional fields beyond the core fields are defined in the features section of the YAML config. Each field uses a mode to
determine how its value is extracted and placed in the output CSV.
value— Interpret the tag value as a numericfloat.presence— Encodes presence of the tag as1.0if present, otherwise0.0. This acts as a binary indicator variable (one-hot feature).score— Behaves identically topresenceduring extraction, but may be treated differently in downstream workflows.
If a tag is missing, malformed, or cannot be converted to a numeric value, the feature value defaults to 0.0.
Data Normalization
OSM data can have inconsistent formatting and units. ExtractOSM can optionally provide some cleanup with the following:
- Normalize detected imperial distances (e.g.,
ft) to meters. - Normalize durations (e.g.,
hours) to minutes. - Apply regex substitutions defined in
config/text_substitutions.ymlto normalize text phrases.
🟢 Derived Fields
These fields are automatically computed based on tag presence and metadata:
🔍 tag_count
tag_count— Count of significant tags attached to the node, using the filtering rules defined inignore_tags.yml.
tag_count can serve as a proxy for feature richness. OSM nodes with more tags are
better described and potentially more important.
- A city node with
population,wikidata,wikipedia, andofficial_nametags may be more significant than one - with only
name.
The ignore_tags config file adds additional processing rules for counting the tags. Not all tags contribute meaningful context. For example the following don't indicate richness of the data:
- Administrative (e.g.,
source,check_date) - Redundant (e.g.,
addr:housenumber,addr:street) - Overly granular (e.g.,
internet_access:ssid)
The ignore_tags.yml file helps to:
- Exclude tags with little semantic value from the count.
- Group tag families like
addr:*orcontact:*into a single count to avoid overrepresentation.
This helps tag_count better reflect meaningful descriptive richness rather than just tag volume.
ignore_tags.yml Configuration
ignore_tags.yml includes a list of tags for special handling using the action as specified below:
| Rule | Description |
|---|---|
0 |
Tag is excluded from the count. |
group |
Tags with a common prefix (e.g., addr:*) are counted once per group. |
| Unlisted | Tags not in the file are counted individually by default. |
Enrichment
CSV-based enrichment files can be loaded at runtime and merged into each node’s record to add data that OSM doesn't contain. Each enrichment file must contain a osm_id column and any number of additional attributes. These attributes are appended as new columns during feature extraction. Multiple enrichment sources may be loaded and merged sequentially.
Filtering
Tag filtering is defined in a YAML file (e.g., *_classification.yml) and passed to osmium to filter the input data. Each tag key (e.g. 'amenity') is associated with a set of accepted values (e.g. 'restaurant', 'bar'). Only nodes matching at least one of the configured filters are retained.
Output
The output is a CSV file. Each row corresponds to one OSM node. Columns include static fields, derived fields, configured features, and enrichment attributes.
▶️ Usage
Run ExtractOSM from the command line to extract filtered and enriched node data from an OSM file.
extract-osm --input x --config x --substitutions x --ignore-tags x --enrichment-dir x --segment x --output x --log-level x
Arguments
-
-l,--log_level— Log level:0: quiet1: info2: debug
Sample Files
YAML Configuration (config/poi_classification.yml)
config_type: "Classification"
# Items to extract from OSM file
keys:
leisure:
filters:
- fitness_centre
- stadium
shop:
filters:
- department_store
- bakery
Ignore Tags (config/ignore_tags.yml)
config_type: "IgnoreTags"
addr:*: group
building: 0
source: 0
ExtractOSM can provide a count of the tags for a node. This file lists tags to ignore for that count to provide a more representative number.
0: exclude this tag from tag countgroup: all tags with this prefix are counted as one (e.g.,addr:*)- Tags not listed are counted as one by default
Enrichment File (data/external/parks_enrich.csv)
osm_id,visitor_count,heritage_rank
123456,8000,1
987654,12000,2
Enrichment files add supplemental data to the matching OSM osm_id. In this example, it adds visitor
counts and heritage rank, which can be useful in scoring importance.
Text Substitutions (config/text_substitutions.yml)
convert_units: 1
substitutions:
"\\bunknown\\b": "0"
"\\bfew\\b|\\bfew(?=-)": "3"
This file provides regex to apply to tag values for cleanup.
It also includes a flag, convert_units. If this is set to 1 then distances will be converted to meters
and time will be converted to minutes. E.g "12 ft" will be converted to 3.6576.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extractosm-1.3.2.tar.gz.
File metadata
- Download URL: extractosm-1.3.2.tar.gz
- Upload date:
- Size: 48.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
579ee59254aa677fc4aec29ff37a584c16db370a2a9519fa5a9ece8d4250aea9
|
|
| MD5 |
53725e9115360a04954c4782209abe3c
|
|
| BLAKE2b-256 |
bc3cac51364917a08ac85536d114c7709e425501d2a314265f85305dd6db14c2
|
File details
Details for the file extractosm-1.3.2-py3-none-any.whl.
File metadata
- Download URL: extractosm-1.3.2-py3-none-any.whl
- Upload date:
- Size: 56.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdb497bf70abb915e1ce2c22b4d977992156d93d60a5f839f19d9ffe86016f9c
|
|
| MD5 |
427596929253c7c40e02d2b40111843d
|
|
| BLAKE2b-256 |
29bb38bdf2f7d7d5063a687b44c20915b8410995f00c99270b5171110010794c
|