Skip to main content

A package for working with the OCOD (Overseas companies that own property in England and Wales ) dataset

Project description

Enhanced OCOD: Offshore Companies Ownership Data Processing Pipeline

Overview

This repository provides a comprehensive pipeline and Python library for cleaning, enhancing, and analyzing the UK Land Registry's Offshore Companies Ownership Data (OCOD). The enhanced OCOD dataset resolves many issues with the raw OCOD data, making it suitable for research, analysis, and reporting on UK property owned by offshore companies.

The project includes:

  • A reusable, modular Python library (src/enhance_ocod) for all data processing stages
  • Example and utility scripts (scripts/) for training NER models, running the pipeline, and more
  • Documentation and reproducible workflows to create, update, and analyze the enhanced OCOD dataset

Key Features

  • End-to-End Pipeline: From raw OCOD data to a classified, enriched, and structured dataset
  • Advanced Address Parsing: Disaggregates multi-property titles and parses free-text addresses
  • Integration with External Data: Uses ONS Postcode Directory, Land Registry Price Paid Data, and VOA business ratings for enrichment
  • Property Classification: Assigns properties to categories (Residential, Business, Airspace, Land, Carpark, Unknown)
  • NER Model Training & Weak Labelling: Fine-tuned modernBERT model automaticall downloaded from HF
  • Reproducible & Extensible: Library-based design for maintainability and reuse

Project Structure

enhance_ocod/
├── src/enhance_ocod/   # Core Python library
│   ├── address_parsing.py
│   ├── analysis.py
│   ├── inference.py
│   ├── labelling/
│   │   ├── ner_regex.py
│   │   ├── ner_spans.py
│   │   └── weak_labelling.py
│   ├── locate_and_classify.py
│   ├── preprocess.py
│   ├── price_paid_process.py
│   └── training.py
├── scripts/            # Example and utility scripts
├── data/               # Input and output data
├── notebooks/          # Analysis performed for the paper
├── tests/              # Unit tests
├── requirements.txt    # Python dependencies
├── pyproject.toml      # Project metadata
├── README.md           # Documentation

Installation

Option 1: Install from PyPI (Recommended)

pip install enhance-ocod

Option 2: Install from GitHub (Latest Development Version)

pip install git+https://github.com/JonnoB/enhance_ocod.git

Option 3: Development Installation

If you want to contribute or modify the code:

  1. Clone the repository:

    git clone https://github.com/JonnoB/enhance_ocod.git
    cd enhance_ocod
    
  2. Install in development mode:

    pip install -e .
    

    Or if you're using uv:

    uv pip install -e .
    

Notes:

  • The package name for installation is enhance-ocod (with hyphen)
  • The import name is enhance_ocod (with underscore)
  • Python automatically handles this naming conversion

Data Requirements

To recreate or update the enhanced OCOD dataset, several open datasets are required. The get_data module has the functionality to download the required files, and the download_hist.py script can be used to perform downloading automatically. If done manually the files must be downloaded and placed in subd-directories of the data/ directory. The sub-directories should be named as follows:

Dataset Folder Type API Available
OCOD dataset ocod_history csv Yes
ONSPD onspd zip Yes
Price Paid dataset price_paid_data folder No
VOA ratings list voa csv Yes

Note:

  • The OCOD dataset is a convoluted experience to get hold of you need to create an account and also use a bank card to confim identity, the bank card will be charged £0.0. Whether this much security is necessary is debatable, and in fact can be debated by contacting your MP to complain.

Usage

You can use the project in two main ways:

1. As a Library

Import modules from src/enhance_ocod in your own scripts:

import pandas as pd
from enhance_ocod.inference import parse_addresses_basic

# Create example DataFrame with the two addresses
example_df = pd.DataFrame({
    'address': [
        "36 - 49, chapel street, London, se45 6pq",
        "Flat 14a, 14 Barnsbury Road, London N1 1JU"
    ],
    'datapoint_id': ['addr_001', 'addr_002']  # Optional unique identifiers
})

print("Example DataFrame:")
print(example_df)

# Default behaviour is to download the finetuned model from Hugginface model library.
results = parse_addresses_basic(example_df)
print(f"Parsed {results['summary']['successful_parses']} addresses")
# ...

2. Using Provided Scripts

  • Run the full pipeline:

    python parse_ocod_history.py 
    
  • Train an NER Model:

    python scripts/run_experiments.py
    

Order to run the scripts in

To Parse the OCOD history you can simply run

python download_hist.py && python parse_ocod_history

This will download all necessary data and then created a folder called ocod_history_processed with one standardised OCOD parquet file per original ocod file. Using an L4 GPU with 24GB VRAM and 16 GB RAM the it will take 2-3 hours to process the first decade of OCOD. This will typicall cost less then $2.

To reproduce the Paper run the files in the following order

  • download_hist.py: Downloads the entire OCOD dataset history and saves by year as zip files. Requires a 'LANDREGISTRY_API' in the .env file.
  • create_weak_labelling_data.py: Using the regex rules weakly label the OCOD February 2022 data set
  • ready_csv_for_training.py: Create the datasets for training and evaluation of the models out of the development set, weakly labelled set and test set.
  • run_experiments.py: Using the dev and weakly labelled sets, train the ModernBERT models. The script also calls the mbert_train_configurable.py script.
  • parse_ocod_history.py: Processes the entire history of the OCOD dataset. You will need to edit the file as the default is to use the model from HuggingFace.
  • price_paide_msoa_averages.py: Calculates the mean price per MSOA, for a rolling three years. This is used by `price_paid_msoa_averages.ipynb.

After you can run the .ipynb files to output the analysis.

Pipeline Stages

The entire process containsed in parse_ocod_history.py is as follows

  1. NER Labelling using a pre-trained modernBERT model
  2. Parsing Create a dataframe using the entities
  3. Geographic Location using ONS/OA system
  4. Classification into property types
  5. Cleanup Expand addresses that are actually multiple addresses (e.g. "Flats 3-10")
  6. Contraction ensure non-residential properties are only a single row

Notebooks

Several Jupyter notebooks are included for development and analysis (located in the notebooks/ directory). These are primarily for the analysis used in the paper:

  • notebooks/exploratory_analysis.ipynb
  • notebooks/price_paid_msoa.ipynb
  • notebooks/test_regex.ipynb

Pre-trained NER model

The fine-tuned modernBERT model is available to download from huggingface. The model can be run directly on address strings using huggingface 'pipeline' functionality, see the model card for details.

Contributing

Contributions and suggestions are welcome! Please open issues or pull requests.

Citation

If you use this repository, please cite:

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

  • The enhanced OCOD dataset and pipeline were demonstrated in the paper: Inspecting the laundromat
  • Built on open data from Land Registry, ONS, and VOA

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enhance_ocod-0.2.1.tar.gz (78.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

enhance_ocod-0.2.1-py3-none-any.whl (76.5 kB view details)

Uploaded Python 3

File details

Details for the file enhance_ocod-0.2.1.tar.gz.

File metadata

  • Download URL: enhance_ocod-0.2.1.tar.gz
  • Upload date:
  • Size: 78.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for enhance_ocod-0.2.1.tar.gz
Algorithm Hash digest
SHA256 762ad327c700ca99484072eff579fd7310f61972e17230c6e74c3ca0c7950f0b
MD5 3d99a11567fbc705cccc8c05e1c30fc9
BLAKE2b-256 4f36228195c7df8eb0cfa194aee12cff6bd7404a7455d23b6be4d6dbd9b41ca2

See more details on using hashes here.

File details

Details for the file enhance_ocod-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: enhance_ocod-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 76.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for enhance_ocod-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5980524c9d6ca519d864c012ce18b00af9ce73138db2897269865252787b3dde
MD5 f4584bbf080ae8d226cc8f2d78e76495
BLAKE2b-256 9f523bfee7bec26891857d683d24a94624fe8986ba3b44164cf11141567251d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page