A Python package to parse UK postcodes from text. Useful in applications such as OCR and IDP.
Project description
uk-postcodes-parsing
A Python package to parse UK postcodes from text. Useful in applications such as OCR and IDP.
Install
pip install uk-postcodes-parsing
Capabilities
- Search and parse UK postcode from text/OCR results
- Extract parts of the postcode: incode, outcode etc.
- Fix common mistakes in UK postcode OCR
Postcode | .outcode | .incode | .area | .district | .subDistrict | .sector | .unit |
---|---|---|---|---|---|---|---|
AA9A 9AA | AA9A | 9AA | AA | AA9 | AA9A | AA9A 9 | AA |
A9A 9AA | A9A | 9AA | A | A9 | A9A | A9A 9 | AA |
A9 9AA | A9 | 9AA | A | A9 | None |
A9 9 | AA |
A99 9AA | A99 | 9AA | A | A99 | None |
A99 9 | AA |
AA9 9AA | AA9 | 9AA | AA | AA9 | None |
AA9 9 | AA |
AA99 9AA | AA99 | 9AA | AA | AA99 | None |
AA99 9 | AA |
- Utilities to validate postcode
- Updated to November 2024: Validate postcode against ~1.8M UK postcodes from the ONS Postcode Directory
Usage
- Parsing text to get a list of postcodes.
>>> from uk_postcodes_parsing import ukpostcode
>>> corpus = "this is a check to see if we can get post codes liek thia ec1r 1ub , and that e3 4ss. But also eh16 50y and ei412"
>>> postcodes = ukpostcode.parse_from_corpus(corpus)
INFO:uk-postcodes-parsing:Found 2 postcodes in corpus
>>> postcodes
[Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='ec1r 1ub', postcode='EC1R 1UB', incode='1UB', outcode='EC1R', area='EC', district='EC1', sub_district='EC1R', sector='EC1R 1', unit='UB'),
Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='e3 4ss', postcode='E3 4SS', incode='4SS', outcode='E3', area='E', district='E3', sub_district=None, sector='E3 4', unit='SS')]
- Optional auto-correct: Attempt correcting common mistakes in postcodes such as reading "O" and "0" and vice-versa.
>>> from uk_postcodes_parsing import ukpostcode
>>> corpus = "this is a check to see if we can get post codes liek thia ec1r 1ub , and that e3 4ss. But also eh16 50y and ei412"
>>> postcodes = ukpostcode.parse_from_corpus(corpus, attempt_fix=True)
INFO:uk-postcodes-parsing:Found 3 postcodes in corpus
INFO:uk-postcodes-parsing:Postcode Fixed: 'eh16 50y' => 'EH16 5OY'
You can also do an undertermisitic postcode auto-correct where if there is more than one possible answer, all answers are returned.
>>> postcodes = ukpostcode.parse_from_corpus("OOO 4SS",
attempt_fix=True,
try_all_fix_options=True
)
>> postcodes # "O00 4SS", "OO0 4SS", and "O0O 4SS"
[Postcode(is_in_ons_postcode_directory=False, fix_distance=-2, original='OOO 4SS', postcode='O00 4SS', incode='4SS', outcode='O00', area='O', district='O00', sub_district=None, sector='O00 4', unit='SS'),
Postcode(is_in_ons_postcode_directory=False, fix_distance=-1, original='OOO 4SS', postcode='OO0 4SS', incode='4SS', outcode='OO0', area='OO', district='OO0', sub_district=None, sector='OO0 4', unit='SS'),
Postcode(is_in_ons_postcode_directory=False, fix_distance=-1, original='OOO 4SS', postcode='O0O 4SS', incode='4SS', outcode='O0O', area='O', district='O0', sub_district='O0O', sector='O0O 4', unit='SS')]
- Parsing
>>> from uk_postcodes_parsing import ukpostcode
>>> ukpostcode.parse("EC1r 1ub")
Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='EC1r 1ub', postcode='EC1R 1UB', incode='1UB', outcode='EC1R', area='EC', district='EC1', sub_district='EC1R', sector='EC1R 1', unit='UB')
>>> ukpostcode.parse("EH16 50Y")
INFO:uk-postcodes-parsing:Postcode Fixed: 'EH16 50Y' => 'EH16 5OY'
Postcode(is_in_ons_postcode_directory=False, fix_distance=-1, original='EH16 50Y', postcode='EH16 5OY', incode='5OY', outcode='EH16', area='EH', district='EH16', sub_district=None, sector='EH16 5', unit='OY')
>>> ukpostcode.parse("EH16 50Y", attempt_fix=False) # Don't attempt fixes during parsing
ERROR:uk-postcodes-parsing:Failed to parse postcode
>>> ukpostcode.parse("0W1")
ERROR:uk-postcodes-parsing:Unable to fix postcode
ERROR:uk-postcodes-parsing:Failed to parse postcode
- Validity check
>>> from uk_postcodes_parsing import postcode_utils
>>> postcode_utils.is_valid("0W1 0AA")
False
>>> postcode_utils.is_valid("OW1 0AA")
True
- Fixing
>>> from uk_postcodes_parsing.fix import fix
>>> fix("0W1 OAA")
'OW1 0AA'
- Validate against ONS Postcode directory (1.7M+ UK postcode upto Nov 2022)
>>> ukpostcode.is_in_ons_postcode_directory("EC1R 1UB")
True
>>> ukpostcode.is_in_ons_postcode_directory("ec1r 1ub") # Expects normalised format (caps + space)
False
Postcode class definition
@dataclass(order=True)
class Postcode:
# Calculate post initialization
is_in_ons_postcode_directory: bool = field(init=False)
fix_distance: int = field(init=False)
# raw text
original: str
# The rest of the fields are parsed from the postcode using regex
postcode: str
incode: str
outcode: str
area: str
district: str
sub_district: Union[str, None]
sector: str
unit: str
-
2 fileds calculated after init of class
is_in_ons_postcode_directory
: Checked against the ONS Postcode Directoryfix_distance
: A measure of number of characters changed from raw text. Each character fix adds a -1 (negative one) to this field.- E.g.
SW1A OAA
=>SW1A 0AA
has fix_distance=-1. Where as,SWIA OAA
=>SW1A 0AA
has fix_distance=-2.
- E.g.
- These fields are particularly helpful when using
parse_from_corpus
withattempt_fix=True
which might return false positives. They can be used as proxy for confidence on which parsed postcodes are correct.-
E.g. If you parse
"send the parcel back to one of the following postcodes: EC1R 1UB or EH16 5AY.
withattempt_fix
:>>> corpus = "send the parcel back to one of the following postcodes: ECIR 1UB or EH16 5AY" >>> postcodes = ukpostcode.parse_from_corpus(corpus, attempt_fix=True) INFO:uk-postcodes-parsing:Found 4 postcodes in corpus INFO:uk-postcodes-parsing:Postcode Fixed: 'to one' => 'T0 0NE' INFO:uk-postcodes-parsing:Postcode Fixed: 'llowing' => 'LL0W 1NG' INFO:uk-postcodes-parsing:Postcode Fixed: 'ecir 1ub' => 'EC1R 1UB' >>> postcodes # you get false positives [Postcode(is_in_ons_postcode_directory=False, fix_distance=-2, original='to one', postcode='T0 0NE', incode='0NE', outcode='T0', area='T', district='T0', sub_district=None, sector='T0 0', unit='NE'), Postcode(is_in_ons_postcode_directory=False, fix_distance=-2, original='llowing', postcode='LL0W 1NG', incode='1NG', outcode='LL0W', area='LL', district='LL0', sub_district='LL0W', sector='LL0W 1', unit='NG'), Postcode(is_in_ons_postcode_directory=True, fix_distance=-1, original='ecir 1ub', postcode='EC1R 1UB', incode='1UB', outcode='EC1R', area='EC', district='EC1', sub_district='EC1R', sector='EC1R 1', unit='UB'), Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='eh16 5ay', postcode='EH16 5AY', incode='5AY', outcode='EH16', area='EH', district='EH16', sub_district=None, sector='EH16 5', unit='AY')]
You can sort a list of postcodes and chose the first n as needed:
>>> sorted(postcodes, reverse=True) [Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='eh16 5ay', postcode='EH16 5AY', incode='5AY', outcode='EH16', area='EH', district='EH16', sub_district=None, sector='EH16 5', unit='AY'), Postcode(is_in_ons_postcode_directory=True, fix_distance=-1, original='ecir 1ub', postcode='EC1R 1UB', incode='1UB', outcode='EC1R', area='EC', district='EC1', sub_district='EC1R', sector='EC1R 1', unit='UB'), Postcode(is_in_ons_postcode_directory=False, fix_distance=-2, original='to one', postcode='T0 0NE', incode='0NE', outcode='T0', area='T', district='T0', sub_district=None, sector='T0 0', unit='NE'), Postcode(is_in_ons_postcode_directory=False, fix_distance=-2, original='llowing', postcode='LL0W 1NG', incode='1NG', outcode='LL0W', area='LL', district='LL0', sub_district='LL0W', sector='LL0W 1', unit='NG')]
Or:
>>> list(filter(lambda postcode: postcode.is_in_ons_postcode_directory, postcodes)) [Postcode(is_in_ons_postcode_directory=True, fix_distance=-1, original='ecir 1ub', postcode='EC1R 1UB', incode='1UB', outcode='EC1R', area='EC', district='EC1', sub_district='EC1R', sector='EC1R 1', unit='UB'), Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='eh16 5ay', postcode='EH16 5AY', incode='5AY', outcode='EH16', area='EH', district='EH16', sub_district=None, sector='EH16 5', unit='AY')]
-
-
raw_text
: To keep track of the original string without formatting changes and auto-fixes. -
8 fileds are parsed using regex
Testing
pytest tests/
Updating this library with newer version of ONS postcode directory
This library has been updated with May 2023 ONS postcode directory. To update this to a newer, version, see: process_onspd.ipynb.
Similar work
This package started as a Python replica of the postcode.io JavaScript library: https://github.com/ideal-postcodes/postcode
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file uk_postcodes_parsing-1.2.1.tar.gz
.
File metadata
- Download URL: uk_postcodes_parsing-1.2.1.tar.gz
- Upload date:
- Size: 8.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.21
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28011685b3dabf97de612eec71b77b46ac05449fa028a3bb8e916bbb57dc29fb |
|
MD5 | 800dc5f2882253b4751d1adfc8055fdf |
|
BLAKE2b-256 | d1abb05e3bf991615ded06b8c748235880c8d1cd935ea0de7f546d82ea96d3c1 |
File details
Details for the file uk_postcodes_parsing-1.2.1-py3-none-any.whl
.
File metadata
- Download URL: uk_postcodes_parsing-1.2.1-py3-none-any.whl
- Upload date:
- Size: 8.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.21
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1720950ed0b7a1c8b344e1808fafaabb5deae039270637fbc4eb743415e5970d |
|
MD5 | 54994544ec8787a1b86ef2a3989137bc |
|
BLAKE2b-256 | 1024309ae3b0875746d0c2ddc68ac01f45a2c4ea5495becf03d4ed6cec52c2bf |