Skip to main content

A dual-head NER-based parser for location strings

Project description

decide-location-formatter

A Python package for parsing and structuring ocation strings into their individual address components. Built around a dual-head NER model (svercoutere/abb-dual-location-component-ner) fine-tuned on top of XLM-RoBERTa base.

How it works

Raw location strings like "Scaldisstraat 23-25, 2000 Antwerpen" or "Cafe den Draak, Lovegemlaan 7, 9000 Gent" are common in municipal decision text but are inconsistently formatted and often contain multiple distinct locations in a single string.

The pipeline has three steps:

  1. Text cleaning — normalises whitespace, unicode, and newlines.
  2. Dual-head NER inference — the model runs two independent CRF-decoded classification heads over every token simultaneously:
    • Component head — tags each token as one of 12 address component types (street, city, postcode, …).
    • Location head — groups tokens that belong to the same physical location into B-LOCATION / I-LOCATION spans, allowing multi-location strings to be split.
  3. Post-processing — component spans are nested inside their parent location spans, housenumber ranges/sequences (e.g. 23-25, 7 en 9) are expanded into individual entries, and bus numbers are split into a separate field.

Architecture

Component Detail
Base encoder xlm-roberta-base (12 layers, 768 hidden)
Component head Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 25) + CRF
Location head Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 3) + CRF
Tokenisation Word-level regex tokeniser; sub-word alignments via fast tokenizer word_ids()
Max input length 256 sub-word tokens

Entity types (component head)

Label Description
STREET Street name (no house number)
ROAD Road or route name
HOUSENUMBER House/building number(s), ranges or sequences
POSTCODE Postal or ZIP code
CITY City or municipality name
PROVINCE Province or region name
BUILDING Named building, site or facility
INTERSECTION Crossing or intersection of roads
PARCEL Land parcel, section or lot number
DISTRICT District, neighbourhood or borough
GRAVE_LOCATION Plot/row/number within a cemetery
DOMAIN_ZONE_AREA Domain, zone or area name

Evaluation

Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.

Metric Score
Combined F1 0.9435
Component F1 0.9295
Location F1 0.9576

Installation

From source (recommended during development)

git clone https://github.com/semantic-ai/decide-location-formatter.git
cd decide-location-formatter
pip install -e .

Dependencies only

pip install torch>=2.0 transformers>=4.35 pytorch-crf>=0.7.2

The model weights (~1 GB) are downloaded automatically from the Hugging Face Hub on first use.

Usage

Quick start

from locationformatter import LocationFormatter

lf = LocationFormatter()   # loads model once; reuse for many calls

result = lf.parse("Scaldisstraat 23-25, 2000 Antwerpen")
print(result)
{
  "original": "Scaldisstraat 23-25, 2000 Antwerpen",
  "locations": [
    {
      "location": "Scaldisstraat 23-25, 2000 Antwerpen",
      "street": "Scaldisstraat",
      "housenumber": "23",
      "housenumber_type": "single",
      "postcode": "2000",
      "city": "Antwerpen"
    },
    {
      "location": "Scaldisstraat 23-25, 2000 Antwerpen",
      "street": "Scaldisstraat",
      "housenumber": "25",
      "housenumber_type": "single",
      "postcode": "2000",
      "city": "Antwerpen"
    }
  ]
}

Multi-location strings

Strings that contain several distinct locations are automatically split:

result = lf.parse("Lovegemlaan 7, 9000 Gent en Dorpstraat 12, 9240 Zele")
for loc in result["locations"]:
    print(loc)

Raw prediction (no housenumber expansion)

predict() returns spans straight from the model without expanding ranges or splitting bus numbers:

raw = lf.predict("Heikeesstraat 2-4, 9240 Zele")
# raw["locations"][0]["housenumber"] == "2-4"
# raw["locations"][0]["housenumber_type"] == "range"

One-shot helper

For a single call without keeping the model in memory:

from locationformatter import parse_location

result = parse_location("Grote Markt 1, 2000 Antwerpen")

Note: parse_location reloads the model on every call. Use LocationFormatter for repeated parsing.

Custom model or device

lf = LocationFormatter(
    repo="your-org/your-model",   # any compatible HF Hub repo
    device="cuda",                # "cpu" or "cuda"; auto-detected when omitted
)

API reference

LocationFormatter

class LocationFormatter:
    def __init__(self, repo: str = "svercoutere/abb-dual-location-component-ner",
                 device: str | None = None): ...

    def parse(self, text: str) -> dict: ...
    # Full pipeline: clean → NER → expand housenumbers.
    # Returns {"original": str, "locations": list[dict]}

    def predict(self, text: str) -> dict: ...
    # NER only, no housenumber expansion.
    # Returns {"original": str, "locations": list[dict]}

Helper functions

from locationformatter import clean_string, clean_house_number, extract_house_and_bus_number

clean_string("  Grote   Markt\n1  ")
# → "Grote Markt 1"

clean_house_number("3 t.e.m. 7")
# → ["3", "4", "5", "6", "7"]

clean_house_number("10-14")
# → ["10", "11", "12", "13", "14"]

extract_house_and_bus_number("5 bus 3")
# → {"housenumber": "5", "bus": "3"}

Output schema

Each entry in the locations list is a flat dict. Only fields detected by the model are included.

Field Type Description
location str The substring corresponding to this location
street str Street name
road str Road/route name
housenumber str Individual house number (after expansion)
housenumber_type str "single", "range", or "sequence"
bus str Bus/apartment number (when present)
postcode str Postal code
city str City or municipality
province str Province
building str Named building or facility
intersection str Road intersection
parcel str Land parcel identifier
district str District or neighbourhood
grave_location str Cemetery plot/row/number
domain_zone_area str Zone or area name

Development

Running tests

pytest tests/

The unit tests for the helper functions (clean_string, clean_house_number, extract_house_and_bus_number) do not require the model to be loaded and run offline.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

locationformatter-0.1.3-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file locationformatter-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for locationformatter-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 974a48d24cb3a148e050aa3e20e8e10160db1f4d06cfd1a0a0ef63494e55f127
MD5 afb33106c94a0bc76000afc26cf8b861
BLAKE2b-256 cee3b442b0b6440c99dfdb33128b2b4cf1c2e6d96e5ecd3f5900244dd64095b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page