A dual-head NER-based parser for location strings
Project description
decide-location-formatter
A Python package for parsing and structuring ocation strings into their individual address components. Built around a dual-head NER model (svercoutere/abb-dual-location-component-ner) fine-tuned on top of XLM-RoBERTa base.
How it works
Raw location strings like "Scaldisstraat 23-25, 2000 Antwerpen" or "Cafe den Draak, Lovegemlaan 7, 9000 Gent" are common in municipal decision text but are inconsistently formatted and often contain multiple distinct locations in a single string.
The pipeline has three steps:
- Text cleaning — normalises whitespace, unicode, and newlines.
- Dual-head NER inference — the model runs two independent CRF-decoded classification heads over every token simultaneously:
- Component head — tags each token as one of 12 address component types (street, city, postcode, …).
- Location head — groups tokens that belong to the same physical location into
B-LOCATION/I-LOCATIONspans, allowing multi-location strings to be split.
- Post-processing — component spans are nested inside their parent location spans, housenumber ranges/sequences (e.g.
23-25,7 en 9) are expanded into individual entries, and bus numbers are split into a separate field.
Architecture
| Component | Detail |
|---|---|
| Base encoder | xlm-roberta-base (12 layers, 768 hidden) |
| Component head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 25) + CRF |
| Location head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 3) + CRF |
| Tokenisation | Word-level regex tokeniser; sub-word alignments via fast tokenizer word_ids() |
| Max input length | 256 sub-word tokens |
Entity types (component head)
| Label | Description |
|---|---|
STREET |
Street name (no house number) |
ROAD |
Road or route name |
HOUSENUMBER |
House/building number(s), ranges or sequences |
POSTCODE |
Postal or ZIP code |
CITY |
City or municipality name |
PROVINCE |
Province or region name |
BUILDING |
Named building, site or facility |
INTERSECTION |
Crossing or intersection of roads |
PARCEL |
Land parcel, section or lot number |
DISTRICT |
District, neighbourhood or borough |
GRAVE_LOCATION |
Plot/row/number within a cemetery |
DOMAIN_ZONE_AREA |
Domain, zone or area name |
Evaluation
Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.
| Metric | Score |
|---|---|
| Combined F1 | 0.9435 |
| Component F1 | 0.9295 |
| Location F1 | 0.9576 |
Installation
From source (recommended during development)
git clone https://github.com/semantic-ai/decide-location-formatter.git
cd decide-location-formatter
pip install -e .
Dependencies only
pip install torch>=2.0 transformers>=4.35 pytorch-crf>=0.7.2
The model weights (~1 GB) are downloaded automatically from the Hugging Face Hub on first use.
Usage
Quick start
from locationformatter import LocationFormatter
lf = LocationFormatter() # loads model once; reuse for many calls
result = lf.parse("Scaldisstraat 23-25, 2000 Antwerpen")
print(result)
{
"original": "Scaldisstraat 23-25, 2000 Antwerpen",
"locations": [
{
"location": "Scaldisstraat 23-25, 2000 Antwerpen",
"street": "Scaldisstraat",
"housenumber": "23",
"housenumber_type": "single",
"postcode": "2000",
"city": "Antwerpen"
},
{
"location": "Scaldisstraat 23-25, 2000 Antwerpen",
"street": "Scaldisstraat",
"housenumber": "25",
"housenumber_type": "single",
"postcode": "2000",
"city": "Antwerpen"
}
]
}
Multi-location strings
Strings that contain several distinct locations are automatically split:
result = lf.parse("Lovegemlaan 7, 9000 Gent en Dorpstraat 12, 9240 Zele")
for loc in result["locations"]:
print(loc)
Raw prediction (no housenumber expansion)
predict() returns spans straight from the model without expanding ranges or splitting bus numbers:
raw = lf.predict("Heikeesstraat 2-4, 9240 Zele")
# raw["locations"][0]["housenumber"] == "2-4"
# raw["locations"][0]["housenumber_type"] == "range"
One-shot helper
For a single call without keeping the model in memory:
from locationformatter import parse_location
result = parse_location("Grote Markt 1, 2000 Antwerpen")
Note:
parse_locationreloads the model on every call. UseLocationFormatterfor repeated parsing.
Custom model or device
lf = LocationFormatter(
repo="your-org/your-model", # any compatible HF Hub repo
device="cuda", # "cpu" or "cuda"; auto-detected when omitted
)
API reference
LocationFormatter
class LocationFormatter:
def __init__(self, repo: str = "svercoutere/abb-dual-location-component-ner",
device: str | None = None): ...
def parse(self, text: str) -> dict: ...
# Full pipeline: clean → NER → expand housenumbers.
# Returns {"original": str, "locations": list[dict]}
def predict(self, text: str) -> dict: ...
# NER only, no housenumber expansion.
# Returns {"original": str, "locations": list[dict]}
Helper functions
from locationformatter import clean_string, clean_house_number, extract_house_and_bus_number
clean_string(" Grote Markt\n1 ")
# → "Grote Markt 1"
clean_house_number("3 t.e.m. 7")
# → ["3", "4", "5", "6", "7"]
clean_house_number("10-14")
# → ["10", "11", "12", "13", "14"]
extract_house_and_bus_number("5 bus 3")
# → {"housenumber": "5", "bus": "3"}
Output schema
Each entry in the locations list is a flat dict. Only fields detected by the model are included.
| Field | Type | Description |
|---|---|---|
location |
str |
The substring corresponding to this location |
street |
str |
Street name |
road |
str |
Road/route name |
housenumber |
str |
Individual house number (after expansion) |
housenumber_type |
str |
"single", "range", or "sequence" |
bus |
str |
Bus/apartment number (when present) |
postcode |
str |
Postal code |
city |
str |
City or municipality |
province |
str |
Province |
building |
str |
Named building or facility |
intersection |
str |
Road intersection |
parcel |
str |
Land parcel identifier |
district |
str |
District or neighbourhood |
grave_location |
str |
Cemetery plot/row/number |
domain_zone_area |
str |
Zone or area name |
Development
Running tests
pytest tests/
The unit tests for the helper functions (clean_string, clean_house_number, extract_house_and_bus_number) do not require the model to be loaded and run offline.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file locationformatter-0.1.3-py3-none-any.whl.
File metadata
- Download URL: locationformatter-0.1.3-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
974a48d24cb3a148e050aa3e20e8e10160db1f4d06cfd1a0a0ef63494e55f127
|
|
| MD5 |
afb33106c94a0bc76000afc26cf8b861
|
|
| BLAKE2b-256 |
cee3b442b0b6440c99dfdb33128b2b4cf1c2e6d96e5ecd3f5900244dd64095b1
|