Skip to main content

A Python package to parse UK postcodes from text. Useful in applications such as OCR and IDP.

Project description

uk-postcodes-parsing

Test Upload Python Package

A Python package to parse UK postcodes from text. Useful in applications such as OCR and IDP.

Install

pip install uk-postcodes-parsing

Capabilities

  • Search and parse UK postcode from text/OCR results
    • Extract parts of the postcode: incode, outcode etc.
    • Fix common mistakes in UK postcode OCR
Postcode .outcode .incode .area .district .subDistrict .sector .unit
AA9A 9AA AA9A 9AA AA AA9 AA9A AA9A 9 AA
A9A 9AA A9A 9AA A A9 A9A A9A 9 AA
A9 9AA A9 9AA A A9 None A9 9 AA
A99 9AA A99 9AA A A99 None A99 9 AA
AA9 9AA AA9 9AA AA AA9 None AA9 9 AA
AA99 9AA AA99 9AA AA AA99 None AA99 9 AA
  • Utilities to validate postcode
  • Updated to November 2024: Validate postcode against ~1.8M UK postcodes from the ONS Postcode Directory

Usage

  • Parsing text to get a list of postcodes.
>>> from uk_postcodes_parsing import ukpostcode
>>> corpus = "this is a check to see if we can get post codes liek thia ec1r 1ub , and that e3 4ss. But also eh16 50y and ei412"
>>> postcodes = ukpostcode.parse_from_corpus(corpus)
INFO:uk-postcodes-parsing:Found 2 postcodes in corpus
>>> postcodes
[Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='ec1r 1ub', postcode='EC1R 1UB', incode='1UB', outcode='EC1R', area='EC', district='EC1', sub_district='EC1R', sector='EC1R 1', unit='UB'),
 Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='e3 4ss', postcode='E3 4SS', incode='4SS', outcode='E3', area='E', district='E3', sub_district=None, sector='E3 4', unit='SS')]
  • Optional auto-correct: Attempt correcting common mistakes in postcodes such as reading "O" and "0" and vice-versa.
>>> from uk_postcodes_parsing import ukpostcode
>>> corpus = "this is a check to see if we can get post codes liek thia ec1r 1ub , and that e3 4ss. But also eh16 50y and ei412"
>>> postcodes = ukpostcode.parse_from_corpus(corpus, attempt_fix=True)
INFO:uk-postcodes-parsing:Found 3 postcodes in corpus
INFO:uk-postcodes-parsing:Postcode Fixed: 'eh16 50y' => 'EH16 5OY'

You can also do an undertermisitic postcode auto-correct where if there is more than one possible answer, all answers are returned.

>>> postcodes = ukpostcode.parse_from_corpus("OOO 4SS",
                             attempt_fix=True,
                             try_all_fix_options=True
                             )
>> postcodes # "O00 4SS", "OO0 4SS", and "O0O 4SS"
[Postcode(is_in_ons_postcode_directory=False, fix_distance=-2, original='OOO 4SS', postcode='O00 4SS', incode='4SS', outcode='O00', area='O', district='O00', sub_district=None, sector='O00 4', unit='SS'),
 Postcode(is_in_ons_postcode_directory=False, fix_distance=-1, original='OOO 4SS', postcode='OO0 4SS', incode='4SS', outcode='OO0', area='OO', district='OO0', sub_district=None, sector='OO0 4', unit='SS'),
 Postcode(is_in_ons_postcode_directory=False, fix_distance=-1, original='OOO 4SS', postcode='O0O 4SS', incode='4SS', outcode='O0O', area='O', district='O0', sub_district='O0O', sector='O0O 4', unit='SS')]
  • Parsing
>>> from uk_postcodes_parsing import ukpostcode
>>> ukpostcode.parse("EC1r 1ub")
Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='EC1r 1ub', postcode='EC1R 1UB', incode='1UB', outcode='EC1R', area='EC', district='EC1', sub_district='EC1R', sector='EC1R 1', unit='UB')
>>> ukpostcode.parse("EH16 50Y")
INFO:uk-postcodes-parsing:Postcode Fixed: 'EH16 50Y' => 'EH16 5OY'
Postcode(is_in_ons_postcode_directory=False, fix_distance=-1, original='EH16 50Y', postcode='EH16 5OY', incode='5OY', outcode='EH16', area='EH', district='EH16', sub_district=None, sector='EH16 5', unit='OY')
>>> ukpostcode.parse("EH16 50Y", attempt_fix=False) # Don't attempt fixes during parsing
ERROR:uk-postcodes-parsing:Failed to parse postcode
>>> ukpostcode.parse("0W1")
ERROR:uk-postcodes-parsing:Unable to fix postcode
ERROR:uk-postcodes-parsing:Failed to parse postcode
  • Validity check
>>> from uk_postcodes_parsing import postcode_utils
>>> postcode_utils.is_valid("0W1 0AA")
False
>>> postcode_utils.is_valid("OW1 0AA")
True
  • Fixing
>>> from uk_postcodes_parsing.fix import fix
>>> fix("0W1 OAA")
'OW1 0AA'
  • Validate against ONS Postcode directory (1.7M+ UK postcode upto Nov 2022)
>>> ukpostcode.is_in_ons_postcode_directory("EC1R 1UB")
True
>>> ukpostcode.is_in_ons_postcode_directory("ec1r 1ub") # Expects normalised format (caps + space)
False

Postcode class definition

@dataclass(order=True)
class Postcode:
    # Calculate post initialization
    is_in_ons_postcode_directory: bool = field(init=False)
    fix_distance: int = field(init=False)
    # raw text
    original: str
    # The rest of the fields are parsed from the postcode using regex
    postcode: str
    incode: str
    outcode: str
    area: str
    district: str
    sub_district: Union[str, None]
    sector: str
    unit: str
  • 2 fileds calculated after init of class

    • is_in_ons_postcode_directory: Checked against the ONS Postcode Directory
    • fix_distance: A measure of number of characters changed from raw text. Each character fix adds a -1 (negative one) to this field.
      • E.g. SW1A OAA => SW1A 0AA has fix_distance=-1. Where as, SWIA OAA => SW1A 0AA has fix_distance=-2.
    • These fields are particularly helpful when using parse_from_corpus with attempt_fix=True which might return false positives. They can be used as proxy for confidence on which parsed postcodes are correct.
      • E.g. If you parse "send the parcel back to one of the following postcodes: EC1R 1UB or EH16 5AY. with attempt_fix:

        >>> corpus = "send the parcel back to one of the following postcodes: ECIR 1UB or EH16 5AY"
        >>> postcodes = ukpostcode.parse_from_corpus(corpus, attempt_fix=True)
        INFO:uk-postcodes-parsing:Found 4 postcodes in corpus
        INFO:uk-postcodes-parsing:Postcode Fixed: 'to one' => 'T0 0NE'
        INFO:uk-postcodes-parsing:Postcode Fixed: 'llowing' => 'LL0W 1NG'
        INFO:uk-postcodes-parsing:Postcode Fixed: 'ecir 1ub' => 'EC1R 1UB'
        >>> postcodes # you get false positives
        [Postcode(is_in_ons_postcode_directory=False, fix_distance=-2, original='to one', postcode='T0 0NE', incode='0NE', outcode='T0', area='T', district='T0', sub_district=None, sector='T0 0', unit='NE'),
        Postcode(is_in_ons_postcode_directory=False, fix_distance=-2, original='llowing', postcode='LL0W 1NG', incode='1NG', outcode='LL0W', area='LL', district='LL0', sub_district='LL0W', sector='LL0W 1', unit='NG'),
        Postcode(is_in_ons_postcode_directory=True, fix_distance=-1, original='ecir 1ub', postcode='EC1R 1UB', incode='1UB', outcode='EC1R', area='EC', district='EC1', sub_district='EC1R', sector='EC1R 1', unit='UB'),
        Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='eh16 5ay', postcode='EH16 5AY', incode='5AY', outcode='EH16', area='EH', district='EH16', sub_district=None, sector='EH16 5', unit='AY')]
        

        You can sort a list of postcodes and chose the first n as needed:

        >>> sorted(postcodes, reverse=True)
        [Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='eh16 5ay', postcode='EH16 5AY', incode='5AY', outcode='EH16', area='EH', district='EH16', sub_district=None, sector='EH16 5', unit='AY'),
        Postcode(is_in_ons_postcode_directory=True, fix_distance=-1, original='ecir 1ub', postcode='EC1R 1UB', incode='1UB', outcode='EC1R', area='EC', district='EC1', sub_district='EC1R', sector='EC1R 1', unit='UB'),
        Postcode(is_in_ons_postcode_directory=False, fix_distance=-2, original='to one', postcode='T0 0NE', incode='0NE', outcode='T0', area='T', district='T0', sub_district=None, sector='T0 0', unit='NE'),
        Postcode(is_in_ons_postcode_directory=False, fix_distance=-2, original='llowing', postcode='LL0W 1NG', incode='1NG', outcode='LL0W', area='LL', district='LL0', sub_district='LL0W', sector='LL0W 1', unit='NG')]
        

        Or:

        >>> list(filter(lambda postcode: postcode.is_in_ons_postcode_directory, postcodes))
        [Postcode(is_in_ons_postcode_directory=True, fix_distance=-1, original='ecir 1ub', postcode='EC1R 1UB', incode='1UB', outcode='EC1R', area='EC', district='EC1', sub_district='EC1R', sector='EC1R 1', unit='UB'),
        Postcode(is_in_ons_postcode_directory=True, fix_distance=0, original='eh16 5ay', postcode='EH16 5AY', incode='5AY', outcode='EH16', area='EH', district='EH16', sub_district=None, sector='EH16 5', unit='AY')]
        
  • raw_text: To keep track of the original string without formatting changes and auto-fixes.

  • 8 fileds are parsed using regex

Testing

pytest tests/

Updating this library with newer version of ONS postcode directory

This library has been updated with May 2023 ONS postcode directory. To update this to a newer, version, see: process_onspd.ipynb.

Similar work

This package started as a Python replica of the postcode.io JavaScript library: https://github.com/ideal-postcodes/postcode

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_postcodes_parsing-1.2.1.tar.gz (8.2 MB view details)

Uploaded Source

Built Distribution

uk_postcodes_parsing-1.2.1-py3-none-any.whl (8.3 MB view details)

Uploaded Python 3

File details

Details for the file uk_postcodes_parsing-1.2.1.tar.gz.

File metadata

  • Download URL: uk_postcodes_parsing-1.2.1.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for uk_postcodes_parsing-1.2.1.tar.gz
Algorithm Hash digest
SHA256 28011685b3dabf97de612eec71b77b46ac05449fa028a3bb8e916bbb57dc29fb
MD5 800dc5f2882253b4751d1adfc8055fdf
BLAKE2b-256 d1abb05e3bf991615ded06b8c748235880c8d1cd935ea0de7f546d82ea96d3c1

See more details on using hashes here.

File details

Details for the file uk_postcodes_parsing-1.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for uk_postcodes_parsing-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1720950ed0b7a1c8b344e1808fafaabb5deae039270637fbc4eb743415e5970d
MD5 54994544ec8787a1b86ef2a3989137bc
BLAKE2b-256 1024309ae3b0875746d0c2ddc68ac01f45a2c4ea5495becf03d4ed6cec52c2bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page