Skip to main content

UK location free-text geocoder — resolves road references, place names, junctions, and infrastructure to lat/lon

Project description

ukgeo

Kaggle Dataset License: MIT Data: ODbL

Quick start

Option A — pip install (recommended)

pip install ukgeo
ukgeo setup        # downloads ~51MB data file from Kaggle
ukgeo geocode "M62 Junction 26"

Option B — from source

git clone https://github.com/ThomasHSimm/ukgeo.git
cd ukgeo
pip install -e ".[dev]"
python scripts/download_os_open_names.py   # ~5 min, builds local parquets
python scripts/download_os_open_roads.py
python scripts/download_osm_named_junctions.py
python scripts/download_osm_roads.py

Python API

from ukgeo import Geocoder

geo = Geocoder()
print(geo.geocode("M62 Junction 26"))
print(geo.geocode("Spaghetti Junction Birmingham"))
print(geo.geocode("Skipton, North Yorkshire"))

CLI usage

After installation, ukgeo is available as a command:

# Single query
ukgeo geocode "M62 Junction 26"
ukgeo geocode "Skipton, North Yorkshire"

# Geocode a CSV file (auto-detects first string column)
ukgeo geocode locations.csv --output results.csv

# Specify column and domain
ukgeo geocode crashes.csv --column road_reference --domain road_safety

# Enable Level 3 OS Names API fallback (requires OS_API_KEY in .env)
ukgeo geocode locations.csv --max-level 3

# Generate an interactive HTML map from geocoded results
ukgeo geocode locations.csv --output results.csv
ukgeo plot results.csv --output results_map.html

# Check installation status
ukgeo info

Output columns added to CSV: lat, lon, confidence, level_resolved, interpreted_as, match_type, candidates_considered, notes.

A tiered UK location free-text geocoder. Converts messy location strings — addresses, road references, place names, colloquial names — to latitude/longitude coordinates using a pipeline that escalates from fast regex matching to OS Open Names lookup, with optional API and local LLM fallbacks.

Designed for bulk processing with a parquet-backed setup step: load reference data once, then geocode hundreds, thousands, or millions of UK location strings in-process.

Features

  • Tiered pipeline — fast paths handle the easy cases; slower paths only fire when needed
  • UK-specific — built on OS Open Names, OS Open Roads, OSM road references, and postcodes.io, tuned for British address conventions
  • Road-aware matching — handles M/A/B road references, motorway junctions, named junctions, and common road suffix abbreviations
  • Bulk-first design — loads OS data once, processes thousands of entries in-process
  • Packaged data fallback — uses richer local source parquets when present, or the combined Kaggle parquet for simpler setup
  • Confidence scoring — every result includes a normalised match score and confidence level
  • Tuneable weights — scoring parameters are configurable and can be calibrated against labelled test data
  • Extensible — Level 3 API and Level 4 local Ollama LLM stubs are ready for future fallback work

Pipeline levels

Level Method Handles
0 Infrastructure alias lookup Named bridges, tunnels, junctions, bus stations
1 Regex + postcodes.io Full UK postcodes, M/A/B road pattern extraction
2 OS/OSM token scoring Places, roads, junctions, named roundabouts
3 OS Names API fallback Bus stations, airports, service stations
4 Local Ollama LLM (stub) Last resort — not yet implemented

Usage

Single geocode

from ukgeo import Geocoder

geo = Geocoder()
result = geo.geocode("Skipton, North Yorkshire")

print(result.lat, result.lon)        # 53.9602, -2.0177
print(result.confidence)             # High
print(result.interpreted_as)         # Skipton (Town)
print(result.level_resolved)         # 2
print(result.notes)                  # match_score=...

Bulk geocode

from ukgeo import Geocoder

geo = Geocoder()
locations = ["LS1 1BA", "M62 Junction 26", "Spaghetti Junction Birmingham"]
df = geo.geocode_batch(locations)
df.write_csv("results.csv")

Output columns: input, lat, lon, interpreted_as, match_type, level_resolved, confidence, candidates_considered, notes.

Benchmarking

test_data = [
    {"input": "Skipton, North Yorkshire", "lat": 53.9619, "lon": -2.0175},
    {"input": "LS1 1BA", "lat": 53.7997, "lon": -1.5492},
]
geo.benchmark(test_data)

Custom weights

from ukgeo import Geocoder, ScoringWeights

weights = ScoringWeights(
    county_context_match=7.0,
    junction_match=10.0,
    high_threshold=0.30,
)
geo = Geocoder(weights=weights)

Calibrate weights

Provide a CSV with columns input, lat, lon:

python scripts/calibrate.py --test data/my_test_locations.csv --trials 300

Best-fit weights are saved to config/weights.yaml and loaded automatically on next Geocoder() init.

Data setup

ukgeo prefers individual source parquets because they preserve richer metadata. If those are absent, it falls back to the combined Kaggle parquet at data/kaggle/ukgeo_data.parquet.

To regenerate the combined Kaggle release file from local source parquets:

python scripts/build_kaggle_dataset.py

For BNG to WGS84 coordinate conversion, pyproj is recommended and included in the project dependencies.

Data sources

Source Licence Used for
OS Open Names Open Government Licence Place names, roads, postcodes
OS Open Roads Open Government Licence Motorway junction points
OpenStreetMap ODbL Named junctions, roundabouts, and B-road segments
postcodes.io MIT Postcode centroid lookup
Kaggle ukgeo combined dataset Mixed source licences Combined fallback parquet for simpler setup

Running tests

pytest -v

Tests require the OS Open Names parquet to be present; they skip otherwise. The suite covers postcodes, motorway junctions, A-roads, B-roads, named interchanges, road-suffix abbreviations, St/Saint ambiguity, colloquial county context, bad county context, place-name typos, and batch geocoding.

Known gaps are documented with strict xfail regression cases rather than being counted as ordinary failures.

Known limitations

See docs/STATUS.md for current test results and documented gaps.

Project structure

ukgeo/
├── scripts/
│   ├── download_os_open_names.py
│   ├── download_os_open_roads.py
│   ├── download_osm_named_junctions.py
│   ├── download_osm_roads.py
│   ├── build_kaggle_dataset.py
│   ├── build_stats19_eval.py
│   └── calibrate.py
├── ukgeo/
│   ├── pipeline.py
│   ├── level1_regex.py
│   ├── level2_ner.py
│   ├── lookup.py
│   ├── models.py
│   └── uk_admin.py
├── tests/
│   └── test_pipeline.py
└── data/

See also

Licence

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ukgeo-0.4.1.tar.gz (91.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ukgeo-0.4.1-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file ukgeo-0.4.1.tar.gz.

File metadata

  • Download URL: ukgeo-0.4.1.tar.gz
  • Upload date:
  • Size: 91.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for ukgeo-0.4.1.tar.gz
Algorithm Hash digest
SHA256 24f42a301654829d3be2ea2e2ba6aaef1435b4d586974e739aeb6b7100675752
MD5 5a5c1e0de4b79e237a144f0e15c58785
BLAKE2b-256 45aa24cb07e61a9602d11a992211574955a9eb22c9af76d47b9b571c62efc489

See more details on using hashes here.

File details

Details for the file ukgeo-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: ukgeo-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for ukgeo-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 070f703cc927ca3dc16252dc5f359e65c5b77cbc80f6d23d09e51dce5d211a47
MD5 b6991ff2ccfb788957582ed5765dfe0f
BLAKE2b-256 6aea2a1b439464c88e293248cc879211e5cdadc749da090c8ae2c494cd2d7815

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page