Skip to main content

A dual-head NER-based parser for location strings

Project description

decide-location-formatter

A Python package for parsing and structuring ocation strings into their individual address components. Built around a dual-head NER model (svercoutere/abb-dual-location-component-ner) fine-tuned on top of XLM-RoBERTa base.

How it works

Raw location strings like "Scaldisstraat 23-25, 2000 Antwerpen" or "Cafe den Draak, Lovegemlaan 7, 9000 Gent" are common in municipal decision text but are inconsistently formatted and often contain multiple distinct locations in a single string.

The pipeline has three steps:

  1. Text cleaning — normalises whitespace, unicode, and newlines.
  2. Dual-head NER inference — the model runs two independent CRF-decoded classification heads over every token simultaneously:
    • Component head — tags each token as one of 12 address component types (street, city, postcode, …).
    • Location head — groups tokens that belong to the same physical location into B-LOCATION / I-LOCATION spans, allowing multi-location strings to be split.
  3. Post-processing — component spans are nested inside their parent location spans, housenumber ranges/sequences (e.g. 23-25, 7 en 9) are expanded into individual entries, and bus numbers are split into a separate field.

Architecture

Component Detail
Base encoder xlm-roberta-base (12 layers, 768 hidden)
Component head Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 25) + CRF
Location head Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 3) + CRF
Tokenisation Word-level regex tokeniser; sub-word alignments via fast tokenizer word_ids()
Max input length 256 sub-word tokens

Entity types (component head)

Label Description
STREET Street name (no house number)
ROAD Road or route name
HOUSENUMBER House/building number(s), ranges or sequences
POSTCODE Postal or ZIP code
CITY City or municipality name
PROVINCE Province or region name
BUILDING Named building, site or facility
INTERSECTION Crossing or intersection of roads
PARCEL Land parcel, section or lot number
DISTRICT District, neighbourhood or borough
GRAVE_LOCATION Plot/row/number within a cemetery
DOMAIN_ZONE_AREA Domain, zone or area name

Evaluation

Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.

Metric Score
Combined F1 0.9435
Component F1 0.9295
Location F1 0.9576

Installation

From source (recommended during development)

git clone https://github.com/semantic-ai/decide-location-formatter.git
cd decide-location-formatter
pip install -e .

Dependencies only

pip install torch>=2.0 transformers>=4.35 pytorch-crf>=0.7.2

The model weights (~1 GB) are downloaded automatically from the Hugging Face Hub on first use.

Usage

Quick start

from locationformatter import LocationFormatter

lf = LocationFormatter()   # loads model once; reuse for many calls

result = lf.parse("Scaldisstraat 23-25, 2000 Antwerpen")
print(result)
{
  "original": "Scaldisstraat 23-25, 2000 Antwerpen",
  "locations": [
    {
      "location": "Scaldisstraat 23-25, 2000 Antwerpen",
      "street": "Scaldisstraat",
      "housenumber": "23",
      "housenumber_type": "single",
      "postcode": "2000",
      "city": "Antwerpen"
    },
    {
      "location": "Scaldisstraat 23-25, 2000 Antwerpen",
      "street": "Scaldisstraat",
      "housenumber": "25",
      "housenumber_type": "single",
      "postcode": "2000",
      "city": "Antwerpen"
    }
  ]
}

Multi-location strings

Strings that contain several distinct locations are automatically split:

result = lf.parse("Lovegemlaan 7, 9000 Gent en Dorpstraat 12, 9240 Zele")
for loc in result["locations"]:
    print(loc)

Raw prediction (no housenumber expansion)

predict() returns spans straight from the model without expanding ranges or splitting bus numbers:

raw = lf.predict("Heikeesstraat 2-4, 9240 Zele")
# raw["locations"][0]["housenumber"] == "2-4"
# raw["locations"][0]["housenumber_type"] == "range"

One-shot helper

For a single call without keeping the model in memory:

from locationformatter import parse_location

result = parse_location("Grote Markt 1, 2000 Antwerpen")

Note: parse_location reloads the model on every call. Use LocationFormatter for repeated parsing.

Custom model or device

lf = LocationFormatter(
    repo="your-org/your-model",   # any compatible HF Hub repo
    device="cuda",                # "cpu" or "cuda"; auto-detected when omitted
)

API reference

LocationFormatter

class LocationFormatter:
    def __init__(self, repo: str = "svercoutere/abb-dual-location-component-ner",
                 device: str | None = None): ...

    def parse(self, text: str) -> dict: ...
    # Full pipeline: clean → NER → expand housenumbers.
    # Returns {"original": str, "locations": list[dict]}

    def predict(self, text: str) -> dict: ...
    # NER only, no housenumber expansion.
    # Returns {"original": str, "locations": list[dict]}

Helper functions

from locationformatter import clean_string, clean_house_number, extract_house_and_bus_number

clean_string("  Grote   Markt\n1  ")
# → "Grote Markt 1"

clean_house_number("3 t.e.m. 7")
# → ["3", "4", "5", "6", "7"]

clean_house_number("10-14")
# → ["10", "11", "12", "13", "14"]

extract_house_and_bus_number("5 bus 3")
# → {"housenumber": "5", "bus": "3"}

Output schema

Each entry in the locations list is a flat dict. Only fields detected by the model are included.

Field Type Description
location str The substring corresponding to this location
street str Street name
road str Road/route name
housenumber str Individual house number (after expansion)
housenumber_type str "single", "range", or "sequence"
bus str Bus/apartment number (when present)
postcode str Postal code
city str City or municipality
province str Province
building str Named building or facility
intersection str Road intersection
parcel str Land parcel identifier
district str District or neighbourhood
grave_location str Cemetery plot/row/number
domain_zone_area str Zone or area name

Development

Running tests

pytest tests/

The unit tests for the helper functions (clean_string, clean_house_number, extract_house_and_bus_number) do not require the model to be loaded and run offline.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

locationformatter-0.1.1.tar.gz (411.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

locationformatter-0.1.1-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file locationformatter-0.1.1.tar.gz.

File metadata

  • Download URL: locationformatter-0.1.1.tar.gz
  • Upload date:
  • Size: 411.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for locationformatter-0.1.1.tar.gz
Algorithm Hash digest
SHA256 cc909f8e9d3b03343b814e05d455b6514b51c11560e776c7a9a39327dca78f91
MD5 a2da385c8c6126e0fc488b5fc47da2b7
BLAKE2b-256 d20218588a15276d8de61606c683f1362c23f1c3d05b8a14d4c597a692fd9228

See more details on using hashes here.

File details

Details for the file locationformatter-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for locationformatter-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a11be73fd59f5957d1fcfaee417a5127de7e0731ce584e87a20c9f9725c88b46
MD5 a663fab4b7cb0acd415a7a4e9a37756a
BLAKE2b-256 45d4a3f2678ccd42350d13c5fa8496d4a190be553961512a91732cebf2a5fee3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page