Skip to main content

Parse and extract structured data from UK government documents — GOV.UK, Hansard, ICO, FCA, BAILII, and ATRS. Research and governance analysis toolkit.

Project description

gov-doc-parser

Parse and extract structured data from UK government documents — GOV.UK, Hansard, ICO, FCA, BAILII, and ATRS. Research and governance analysis toolkit.

Tests Dependencies Python License LinkedIn

Install

pip install gov-doc-parser

Zero external dependencies — pure Python stdlib.

Quick start

from gov_doc_parser import GovDocParser

parser = GovDocParser()

# Parse any UK gov source
doc = parser.parse(html_text, source="ico")
print(doc.title, doc.date, doc.metadata)

# Auto-detect from URL
doc = parser.parse(html, source="auto", url="https://ico.org.uk/...")

# Extract AI references with sentiment
result = parser.parse_full(html, source="govuk")
for ref in result.ai_references:
    print(f"[{ref.sentiment}] {ref.term}: {ref.context[:100]}")
# [regulatory] algorithm: ...automated decision-making must comply with UK GDPR...
# [negative] artificial intelligence: ...AI found to be unlawful under Equality Act...

# Parse ATRS record
doc, atrs = parser.parse_atrs(atrs_html)
print(atrs.system_name, atrs.governance_score)  # 0-100 transparency score
print(atrs.dpia_completed, atrs.human_review, atrs.legal_basis)

# Batch
results = parser.batch_parse([
    {"html": govuk_html, "source": "govuk"},
    {"html": ico_html, "source": "ico"},
])

Supported sources

Source Extracts
GOV.UK Title, date, department, document type, sections
Hansard Date, house, speaker, debate text, AI mentions
ICO Enforcement type, penalty amount, decision text
FCA Document type (Dear CEO/PS/CP), FCA reference
BAILII Citation, court, judge, judgment text
ATRS System name, risk tier, DPIA status, governance score

CLI

gov-doc-parser document.html --source ico
gov-doc-parser document.html --source govuk --ai-refs
gov-doc-parser atrs_record.html --atrs
gov-doc-parser document.html --json

Linda Oraegbunam | LinkedIn | GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gov_doc_parser-1.0.0.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gov_doc_parser-1.0.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file gov_doc_parser-1.0.0.tar.gz.

File metadata

  • Download URL: gov_doc_parser-1.0.0.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gov_doc_parser-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a18cc746240de3f5255acd7ed20692847701c595b0d1b720cf99fcde9d8eab92
MD5 c9c36fd2b5b8e9dfbe943a4775bd0b45
BLAKE2b-256 97b24616d780524e3f49e6bebd9d4e414064adffe6778d09918af6868aaf224b

See more details on using hashes here.

Provenance

The following attestation bundles were made for gov_doc_parser-1.0.0.tar.gz:

Publisher: publish.yml on obielin/gov-doc-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gov_doc_parser-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: gov_doc_parser-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gov_doc_parser-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a80572e03acc48e5dd1b0a856459fdd6dd8ec830e17f63e2ab731b00c0db5c8d
MD5 e22a4114897e4738f96333ee34b70d85
BLAKE2b-256 260481fb1d8a1e06441f9e9e542d23ad3413b60efc647e5c810105adb4185ad3

See more details on using hashes here.

Provenance

The following attestation bundles were made for gov_doc_parser-1.0.0-py3-none-any.whl:

Publisher: publish.yml on obielin/gov-doc-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page