UK location free-text geocoder — resolves road references, place names, junctions, and infrastructure to lat/lon
Project description
ukgeo
Quick start
Option A — pip install (recommended)
pip install ukgeo
ukgeo setup # downloads ~51MB data file from Kaggle
ukgeo geocode "M62 Junction 26"
Option B — from source
git clone https://github.com/ThomasHSimm/ukgeo.git
cd ukgeo
pip install -e ".[dev]"
python scripts/download_os_open_names.py # ~5 min, builds local parquets
python scripts/download_os_open_roads.py
python scripts/download_osm_named_junctions.py
python scripts/download_osm_roads.py
Python API
from ukgeo import Geocoder
geo = Geocoder()
print(geo.geocode("M62 Junction 26"))
print(geo.geocode("Spaghetti Junction Birmingham"))
print(geo.geocode("Skipton, North Yorkshire"))
CLI usage
After installation, ukgeo is available as a command:
# Single query
ukgeo geocode "M62 Junction 26"
ukgeo geocode "Skipton, North Yorkshire"
# Geocode a CSV file (auto-detects first string column)
ukgeo geocode locations.csv --output results.csv
# Specify column and domain
ukgeo geocode crashes.csv --column road_reference --domain road_safety
# Enable Level 3 OS Names API fallback (requires OS_API_KEY in .env)
ukgeo geocode locations.csv --max-level 3
# Generate an interactive HTML map from geocoded results
ukgeo geocode locations.csv --output results.csv
ukgeo plot results.csv --output results_map.html
# Check installation status
ukgeo info
Output columns added to CSV: lat, lon, confidence, level_resolved,
interpreted_as, match_type, candidates_considered, notes.
A tiered UK location free-text geocoder. Converts messy location strings — addresses, road references, place names, colloquial names — to latitude/longitude coordinates using a pipeline that escalates from fast regex matching to OS Open Names lookup, with optional API and local LLM fallbacks.
Designed for bulk processing with a parquet-backed setup step: load reference data once, then geocode hundreds, thousands, or millions of UK location strings in-process.
Features
- Tiered pipeline — fast paths handle the easy cases; slower paths only fire when needed
- UK-specific — built on OS Open Names, OS Open Roads, OSM road references, and postcodes.io, tuned for British address conventions
- Road-aware matching — handles M/A/B road references, motorway junctions, named junctions, and common road suffix abbreviations
- Bulk-first design — loads OS data once, processes thousands of entries in-process
- Packaged data fallback — uses richer local source parquets when present, or the combined Kaggle parquet for simpler setup
- Confidence scoring — every result includes a normalised match score and confidence level
- Tuneable weights — scoring parameters are configurable and can be calibrated against labelled test data
- Extensible — Level 3 API and Level 4 local Ollama LLM stubs are ready for future fallback work
Pipeline levels
| Level | Method | Handles |
|---|---|---|
| 0 | Infrastructure alias lookup | Named bridges, tunnels, junctions, bus stations |
| 1 | Regex + postcodes.io | Full UK postcodes, M/A/B road pattern extraction |
| 2 | OS/OSM token scoring | Places, roads, junctions, named roundabouts |
| 3 | OS Names API fallback | Bus stations, airports, service stations |
| 4 | Local Ollama LLM (stub) | Last resort — not yet implemented |
Usage
Single geocode
from ukgeo import Geocoder
geo = Geocoder()
result = geo.geocode("Skipton, North Yorkshire")
print(result.lat, result.lon) # 53.9602, -2.0177
print(result.confidence) # High
print(result.interpreted_as) # Skipton (Town)
print(result.level_resolved) # 2
print(result.notes) # match_score=...
Bulk geocode
from ukgeo import Geocoder
geo = Geocoder()
locations = ["LS1 1BA", "M62 Junction 26", "Spaghetti Junction Birmingham"]
df = geo.geocode_batch(locations)
df.write_csv("results.csv")
Output columns: input, lat, lon, interpreted_as, match_type, level_resolved, confidence, candidates_considered, notes.
Benchmarking
test_data = [
{"input": "Skipton, North Yorkshire", "lat": 53.9619, "lon": -2.0175},
{"input": "LS1 1BA", "lat": 53.7997, "lon": -1.5492},
]
geo.benchmark(test_data)
Custom weights
from ukgeo import Geocoder, ScoringWeights
weights = ScoringWeights(
county_context_match=7.0,
junction_match=10.0,
high_threshold=0.30,
)
geo = Geocoder(weights=weights)
Calibrate weights
Provide a CSV with columns input, lat, lon:
python scripts/calibrate.py --test data/my_test_locations.csv --trials 300
Best-fit weights are saved to config/weights.yaml and loaded automatically on next Geocoder() init.
Data setup
ukgeo prefers individual source parquets because they preserve richer metadata. If those are absent, it falls back to the combined Kaggle parquet at data/kaggle/ukgeo_data.parquet.
To regenerate the combined Kaggle release file from local source parquets:
python scripts/build_kaggle_dataset.py
For BNG to WGS84 coordinate conversion, pyproj is recommended and included in the project dependencies.
Data sources
| Source | Licence | Used for |
|---|---|---|
| OS Open Names | Open Government Licence | Place names, roads, postcodes |
| OS Open Roads | Open Government Licence | Motorway junction points |
| OpenStreetMap | ODbL | Named junctions, roundabouts, and B-road segments |
| postcodes.io | MIT | Postcode centroid lookup |
| Kaggle ukgeo combined dataset | Mixed source licences | Combined fallback parquet for simpler setup |
Running tests
pytest -v
Tests require the OS Open Names parquet to be present; they skip otherwise. The suite covers postcodes, motorway junctions, A-roads, B-roads, named interchanges, road-suffix abbreviations, St/Saint ambiguity, colloquial county context, bad county context, place-name typos, and batch geocoding.
Known gaps are documented with strict xfail regression cases rather than being counted as ordinary failures.
Known limitations
See docs/STATUS.md for current test results and documented gaps.
Project structure
ukgeo/
├── scripts/
│ ├── download_os_open_names.py
│ ├── download_os_open_roads.py
│ ├── download_osm_named_junctions.py
│ ├── download_osm_roads.py
│ ├── build_kaggle_dataset.py
│ ├── build_stats19_eval.py
│ └── calibrate.py
├── ukgeo/
│ ├── pipeline.py
│ ├── level1_regex.py
│ ├── level2_ner.py
│ ├── lookup.py
│ ├── models.py
│ └── uk_admin.py
├── tests/
│ └── test_pipeline.py
└── data/
See also
- Kaggle dataset — pre-built data download
- Open Road Risk — road safety risk modelling pipeline that ukgeo supports
- docs/alternatives.md — honest comparison with other UK geocoding tools
- docs/gaps_and_ecosystem.md — what ukgeo is missing and the broader ecosystem
- docs/STATUS.md — current test results and benchmark numbers
- TODO.md — development roadmap
Licence
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ukgeo-0.4.1.tar.gz.
File metadata
- Download URL: ukgeo-0.4.1.tar.gz
- Upload date:
- Size: 91.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24f42a301654829d3be2ea2e2ba6aaef1435b4d586974e739aeb6b7100675752
|
|
| MD5 |
5a5c1e0de4b79e237a144f0e15c58785
|
|
| BLAKE2b-256 |
45aa24cb07e61a9602d11a992211574955a9eb22c9af76d47b9b571c62efc489
|
File details
Details for the file ukgeo-0.4.1-py3-none-any.whl.
File metadata
- Download URL: ukgeo-0.4.1-py3-none-any.whl
- Upload date:
- Size: 41.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
070f703cc927ca3dc16252dc5f359e65c5b77cbc80f6d23d09e51dce5d211a47
|
|
| MD5 |
b6991ff2ccfb788957582ed5765dfe0f
|
|
| BLAKE2b-256 |
6aea2a1b439464c88e293248cc879211e5cdadc749da090c8ae2c494cd2d7815
|