A package for working with the OCOD (Overseas companies that own property in England and Wales ) dataset

These details have not been verified by PyPI

Project links

Project description

Enhanced OCOD: Offshore Companies Ownership Data Processing Pipeline

Overview

This repository provides a comprehensive pipeline and Python library for cleaning, enhancing, and analyzing the UK Land Registry's Offshore Companies Ownership Data (OCOD). The enhanced OCOD dataset resolves many issues with the raw OCOD data, making it suitable for research, analysis, and reporting on UK property owned by offshore companies.

The project includes:

A reusable, modular Python library (src/enhance_ocod) for all data processing stages
Example and utility scripts (scripts/) for training NER models, running the pipeline, and more
Documentation and reproducible workflows to create, update, and analyze the enhanced OCOD dataset

Key Features

End-to-End Pipeline: From raw OCOD data to a classified, enriched, and structured dataset
Advanced Address Parsing: Disaggregates multi-property titles and parses free-text addresses
Integration with External Data: Uses ONS Postcode Directory, Land Registry Price Paid Data, and VOA business ratings for enrichment
Property Classification: Assigns properties to categories (Residential, Business, Airspace, Land, Carpark, Unknown)
NER Model Training & Weak Labelling: Fine-tuned modernBERT model automaticall downloaded from HF
Reproducible & Extensible: Library-based design for maintainability and reuse

Project Structure

enhance_ocod/
├── src/enhance_ocod/   # Core Python library
│   ├── address_parsing.py
│   ├── analysis.py
│   ├── inference.py
│   ├── labelling/
│   │   ├── ner_regex.py
│   │   ├── ner_spans.py
│   │   └── weak_labelling.py
│   ├── locate_and_classify.py
│   ├── preprocess.py
│   ├── price_paid_process.py
│   └── training.py
├── scripts/            # Example and utility scripts
├── data/               # Input and output data
├── notebooks/          # Analysis performed for the paper
├── tests/              # Unit tests
├── requirements.txt    # Python dependencies
├── pyproject.toml      # Project metadata
├── README.md           # Documentation

Installation

Clone the repository:

git clone https://github.com/JonnoB/enhance_ocod/tree/main
cd enhance_ocod

Install dependencies:
```
pip install -r requirements.txt
```

Data Requirements

To recreate or update the enhanced OCOD dataset, several open datasets are required. The get_data module has the functionality to download the required files, and the download_hist.py script can be used to perform downloading automatically. If done manually the files must be downloaded and placed in subd-directories of the data/ directory. The sub-directories should be named as follows:

Dataset	Folder	Type	API Available
OCOD dataset	ocod_history	csv	Yes
ONSPD	onspd	zip	Yes
Price Paid dataset	price_paid_data	folder	No
VOA ratings list	voa	csv	Yes

Note:

The OCOD dataset is a convoluted experience to get hold of you need to create an account and also use a bank card to confim identity, the bank card will be charged £0.0. Whether this much security is necessary is debatable, and in fact can be debated by contacting your MP to complain.

Usage

You can use the project in two main ways:

1. As a Library

Import modules from src/enhance_ocod in your own scripts:

import pandas as pd
from enhance_ocod.inference import parse_addresses_basic

# Create example DataFrame with the two addresses
example_df = pd.DataFrame({
    'address': [
        "36 - 49, chapel street, London, se45 6pq",
        "Flat 14a, 14 Barnsbury Road, London N1 1JU"
    ],
    'datapoint_id': ['addr_001', 'addr_002']  # Optional unique identifiers
})

print("Example DataFrame:")
print(example_df)

# Default behaviour is to download the finetuned model from Hugginface model library.
results = parse_addresses_basic(example_df)
print(f"Parsed {results['summary']['successful_parses']} addresses")
# ...

2. Using Provided Scripts

Run the full pipeline:
```
python parse_ocod_history.py 
```
Train an NER Model:
```
python scripts/run_experiments.py
```

Order to run the scripts in

download_hist.py: Downloads the entire OCOD dataset history and saves by year as zip files. Requires a 'LANDREGISTRY_API' in the .env file.
create_weak_labelling_data.py: Using the regex rules weakly label the OCOD February 2022 data set
ready_csv_for_training.py: Create the datasets for training and evaluation of the models out of the development set, weakly labelled set and test set.
run_experiments.py: Using the dev and weakly labelled sets, train the ModernBERT models. The script also calls the mbert_train_configurable.py script.
parse_ocod_history: Processes the entire history of the OCOD dataset. Using the pre-trained model can be run directly after download_hist.py
price_paide_msoa_averages.py: Calculates the mean price per MSOA, for a rolling three years. This is used by `price_paid_msoa_averages.ipynb

Pipeline Stages

The entire process containsed in parse_ocod_history.py is as follows

NER Labelling using a pre-trained modernBERT model
Parsing Create a dataframe using the entities
Geographic Location using ONS/OA system
Classification into property types
Cleanup Expand addresses that are actually multiple addresses (e.g. "Flats 3-10")
Contraction ensure non-residential properties are only a single row

Notebooks

Several Jupyter notebooks are included for development and analysis (located in the notebooks/ directory). These are primarily for the analysis used in the paper:

notebooks/exploratory_analysis.ipynb
notebooks/price_paid_msoa.ipynb
notebooks/test_regex.ipynb

Pre-trained NER model

The fine-tuned modernBERT model is available to download from huggingface. The model can be run directly on address strings using huggingface 'pipeline' functionality, see the model card for details.

Contributing

Contributions and suggestions are welcome! Please open issues or pull requests.

Citation

If you use this repository, please cite:

J Bourne et al (2023). "What's in the laundromat? Mapping and characterising offshore owned residential property in London" https://doi.org/10.1177/2399808323115548

License

This project is licensed under the GNU 3.0 License. See the LICENSE file for details.

Acknowledgements

The enhanced OCOD dataset and pipeline were demonstrated in the paper: Inspecting the laundromat
Built on open data from Land Registry, ONS, and VOA

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Oct 3, 2025

This version

0.2.0

Sep 19, 2025

0.1.0

Sep 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enhance_ocod-0.2.0.tar.gz (87.0 kB view details)

Uploaded Sep 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

enhance_ocod-0.2.0-py3-none-any.whl (85.9 kB view details)

Uploaded Sep 19, 2025 Python 3

File details

Details for the file enhance_ocod-0.2.0.tar.gz.

File metadata

Download URL: enhance_ocod-0.2.0.tar.gz
Upload date: Sep 19, 2025
Size: 87.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for enhance_ocod-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`13da20c996c783724bd3bc310f8b1d4612e46476dc10e418e361f80ef340290f`
MD5	`616302fdb330f241103e434494a7bb97`
BLAKE2b-256	`1792f5f9a960f5ee17be1a5819d5ad79c5aa0d0e7a02da4d8b11f39c205763d9`

See more details on using hashes here.

File details

Details for the file enhance_ocod-0.2.0-py3-none-any.whl.

File metadata

Download URL: enhance_ocod-0.2.0-py3-none-any.whl
Upload date: Sep 19, 2025
Size: 85.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for enhance_ocod-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4036d06f2e8aa2ddb962c15ed8f85fdbc54fb5c6d28a0de04f0dd113beed937e`
MD5	`ef0a761b186fcb107624f79c59534ace`
BLAKE2b-256	`052e46a6e9a56b6a9191c8b919e3aee8e1f809029425738e74b082ee1268e20f`

See more details on using hashes here.

enhance-ocod 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Enhanced OCOD: Offshore Companies Ownership Data Processing Pipeline

Overview

Key Features

Project Structure

Installation

Data Requirements

Usage

1. As a Library

2. Using Provided Scripts

Order to run the scripts in

Pipeline Stages

Notebooks

Pre-trained NER model

Contributing

Citation

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes