Skip to main content

UK location free-text geocoder — resolves road references, place names, junctions, and infrastructure to lat/lon

Project description

ukgeo

Kaggle Dataset License: MIT Data: ODbL

Quick start

Option A — pip install (recommended)

pip install ukgeo
ukgeo setup        # downloads ~41MB data file from Kaggle
ukgeo geocode "M62 Junction 26"

Option B — from source

git clone https://github.com/ThomasHSimm/ukgeo.git
cd ukgeo
pip install -e ".[dev]"
python scripts/download_os_open_names.py   # ~5 min, builds local parquets
python scripts/download_os_open_roads.py
python scripts/download_osm_named_junctions.py
python scripts/download_osm_roads.py

Python API

from ukgeo import Geocoder

geo = Geocoder()
print(geo.geocode("M62 Junction 26"))
print(geo.geocode("Spaghetti Junction Birmingham"))
print(geo.geocode("Skipton, North Yorkshire"))

CLI usage

After installation, ukgeo is available as a command:

# Single query
ukgeo geocode "M62 Junction 26"
ukgeo geocode "Skipton, North Yorkshire"

# Geocode a CSV file (auto-detects first string column)
ukgeo geocode locations.csv --output results.csv

# Specify column and domain
ukgeo geocode crashes.csv --column road_reference --domain road_safety

# Enable Level 3 OS Names API fallback (requires OS_API_KEY in .env)
ukgeo geocode locations.csv --max-level 3

# Generate an interactive HTML map from geocoded results
ukgeo geocode locations.csv --output results.csv
ukgeo plot results.csv --output results_map.html

# Check installation status
ukgeo info

Output columns added to CSV: lat, lon, confidence, level_resolved, interpreted_as, match_type, candidates_considered, notes.

A tiered UK location free-text geocoder. Converts messy location strings — addresses, road references, place names, colloquial names — to latitude/longitude coordinates using a pipeline that escalates from fast regex matching to OS Open Names lookup, with optional API and local LLM fallbacks.

Designed for bulk processing with a parquet-backed setup step: load reference data once, then geocode hundreds, thousands, or millions of UK location strings in-process.

Features

  • Tiered pipeline — fast paths handle the easy cases; slower paths only fire when needed
  • UK-specific — built on OS Open Names, OS Open Roads, OSM road references, and postcodes.io, tuned for British address conventions
  • Road-aware matching — handles M/A/B road references, motorway junctions, named junctions, and common road suffix abbreviations
  • Bulk-first design — loads OS data once, processes thousands of entries in-process
  • Packaged data fallback — uses richer local source parquets when present, or the combined Kaggle parquet for simpler setup
  • Confidence scoring — every result includes a normalised match score and confidence level
  • Tuneable weights — scoring parameters are configurable and can be calibrated against labelled test data
  • Extensible — Level 3 API and Level 4 local Ollama LLM stubs are ready for future fallback work

Pipeline levels

Level Method Handles
1 Regex + postcodes.io/local postcode fallback Full UK postcodes, M/A/B road pattern extraction
2 OS/OpenStreetMap token scoring Places, roads, junctions, named roundabouts/interchanges
3 API fallback (stub) Ambiguous cases needing external lookup
4 Local Ollama LLM (stub) Last resort: typos, novel references

Usage

Single geocode

from ukgeo import Geocoder

geo = Geocoder()
result = geo.geocode("Skipton, North Yorkshire")

print(result.lat, result.lon)        # 53.9602, -2.0177
print(result.confidence)             # High
print(result.interpreted_as)         # Skipton (Town)
print(result.level_resolved)         # 2
print(result.notes)                  # match_score=...

Bulk geocode

from ukgeo import Geocoder

geo = Geocoder()
locations = ["LS1 1BA", "M62 Junction 26", "Spaghetti Junction Birmingham"]
df = geo.geocode_batch(locations)
df.write_csv("results.csv")

Output columns: input, lat, lon, interpreted_as, match_type, level_resolved, confidence, candidates_considered, notes.

Benchmarking

test_data = [
    {"input": "Skipton, North Yorkshire", "lat": 53.9619, "lon": -2.0175},
    {"input": "LS1 1BA", "lat": 53.7997, "lon": -1.5492},
]
geo.benchmark(test_data)

Custom weights

from ukgeo import Geocoder, ScoringWeights

weights = ScoringWeights(
    county_context_match=7.0,
    junction_match=10.0,
    high_threshold=0.30,
)
geo = Geocoder(weights=weights)

Calibrate weights

Provide a CSV with columns input, lat, lon:

python scripts/calibrate.py --test data/my_test_locations.csv --trials 300

Best-fit weights are saved to config/weights.yaml and loaded automatically on next Geocoder() init.

Data setup

ukgeo prefers individual source parquets because they preserve richer metadata. If those are absent, it falls back to the combined Kaggle parquet at data/kaggle/ukgeo_data.parquet.

To regenerate the combined Kaggle release file from local source parquets:

python scripts/build_kaggle_dataset.py

For BNG to WGS84 coordinate conversion, pyproj is recommended and included in the project dependencies.

Data sources

Source Licence Used for
OS Open Names Open Government Licence Place names, roads, postcodes
OS Open Roads Open Government Licence Motorway junction points
OpenStreetMap ODbL Named junctions, roundabouts, and B-road segments
postcodes.io MIT Postcode centroid lookup
Kaggle ukgeo combined dataset Mixed source licences Combined fallback parquet for simpler setup

Running tests

pytest -v

Tests require the OS Open Names parquet to be present; they skip otherwise. The suite covers postcodes, motorway junctions, A-roads, B-roads, named interchanges, road-suffix abbreviations, St/Saint ambiguity, colloquial county context, bad county context, place-name typos, and batch geocoding.

Known gaps are documented with strict xfail regression cases rather than being counted as ordinary failures.

Known limitations

See docs/STATUS.md for current test results and documented gaps.

Project structure

ukgeo/
├── scripts/
│   ├── download_os_open_names.py
│   ├── download_os_open_roads.py
│   ├── download_osm_named_junctions.py
│   ├── download_osm_roads.py
│   ├── build_kaggle_dataset.py
│   ├── build_stats19_eval.py
│   └── calibrate.py
├── ukgeo/
│   ├── pipeline.py
│   ├── level1_regex.py
│   ├── level2_ner.py
│   ├── lookup.py
│   ├── models.py
│   └── uk_admin.py
├── tests/
│   └── test_pipeline.py
└── data/

See also

Licence

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ukgeo-0.4.0.tar.gz (93.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ukgeo-0.4.0-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file ukgeo-0.4.0.tar.gz.

File metadata

  • Download URL: ukgeo-0.4.0.tar.gz
  • Upload date:
  • Size: 93.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for ukgeo-0.4.0.tar.gz
Algorithm Hash digest
SHA256 2b3c0e29b419659e32ea607f0bb23d59ff13e530c0bfe1f63b4eec4dd51d1f6b
MD5 e859d3dd4b2432fc57ec585abda6fdfa
BLAKE2b-256 381339c8cbcb6c25d4bf51144afb3566a65e619f1ddca828060ebb5753af9db7

See more details on using hashes here.

File details

Details for the file ukgeo-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: ukgeo-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for ukgeo-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1276d72e62c19b293f1a5a1f0b314799009139a0363f623afd157649bb3274bd
MD5 1f20a84f8762be7e988360c997a77721
BLAKE2b-256 4d0768defbb1396b30ebbaa53c4499dda0df4f64c17b8940aa087f5dbeb15e86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page