A library for creating structured datasets and RAG embeddings from government websites

These details have not been verified by PyPI

Project links

Project description

OpenGovCorpus

A Python library for creating structured datasets and RAG embeddings from government websites.

Installation

We recommend using UV for fast, reliable package installation:

uv pip install opengovcorpus

Or with pip:

pip install opengovcorpus

Configuration

Setting Up API Keys

Create a configuration file to store your API credentials:

Location: ~/.opengovcorpus/config.json

{
  "provider": "openai",
  "api_key": "sk-your-api-key-here"
}

The library supports multiple embedding providers:

OpenAI: "provider": "openai"
Gemini: "provider": "gemini"
Hugging Face: "provider": "huggingface"

The library automatically reads this configuration file when generating embeddings.

Usage

Import the Library

import opengovcorpus as og

1. Create Dataset

Scrape a government website and create a structured knowledge graph with prompt-response pairs, automatically split into train/validation/test sets.

og.create_dataset(
    name="uk",                    # Name of the dataset folder
    url="https://data.gov.uk",       # Government website to scrape
    include_metadata=True,           # Include metadata for each prompt-response pair
    train_split=0.8,                 # 80% for training
    val_split=0.1,                   # 10% for validation
    test_split=0.1                   # 10% for testing
)

Output Structure:

OpenGovCorpus-uk/
├── train.csv
├── valid.csv
└── test.csv

Each CSV contains structured prompt-response pairs suitable for fine-tuning or RAG applications.

2. Generate RAG Embeddings

Convert your dataset into vector embeddings for retrieval-augmented generation (RAG) applications.

og.create_rag_embeddings(
    model="openai/text-embedding-3-large",           # Embedding model
    vector_db="chroma",                              # Vector database (Chroma by default)
    config_path="~/.opengovcorpus/config.json"      # Path to config file
)

Supported Models:

OpenAI: openai/text-embedding-3-large, openai/text-embedding-3-small
Gemini: gemini/text-embedding-004
Hugging Face: hf/sentence-transformers/all-MiniLM-L6-v2, hf/BAAI/bge-large-en-v1.5

How it works:

Reads train.csv, valid.csv, and test.csv from your dataset
Converts prompts (and optionally responses) into vector embeddings
Stores embeddings in the specified vector database (Chroma local storage by default)
Enables efficient semantic search and retrieval for RAG applications

Complete Example

import opengovcorpus as og

# Step 1: Create dataset from government website
og.create_dataset(
    name="uk-data",
    url="https://data.gov.uk",
    include_metadata=True,
    train_split=0.8,
    val_split=0.1,
    test_split=0.1
)

# Step 2: Generate embeddings for RAG
og.create_rag_embeddings(
    model="openai/text-embedding-3-large",
    vector_db="chroma",
    config_path="~/.opengovcorpus/config.json"
)

Features

Automated Web Scraping: Extract structured data from government websites
Knowledge Graph Creation: Convert scraped content into meaningful prompt-response pairs
Dataset Splitting: Automatic train/validation/test split configuration
Multi-Provider Support: Works with OpenAI, Gemini, and Hugging Face embeddings
Vector Database Integration: Built-in Chroma support for local embedding storage
RAG-Ready: Outputs are optimized for retrieval-augmented generation workflows

Use Cases

Building government data chatbots
Fine-tuning language models on official documentation
Creating semantic search systems for public information
Developing RAG applications for policy and regulation queries
Generating training datasets for civic tech applications

Requirements

Python 3.8+
API key for your chosen embedding provider (OpenAI, Gemini, or Hugging Face)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Support

For issues, questions, or feature requests, please open an issue on the GitHub repository.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.4

Nov 19, 2025

0.1.3

Nov 19, 2025

This version

0.1.2

Nov 17, 2025

0.1.1

Nov 12, 2025

0.1.0

Nov 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opengovcorpus-0.1.2.tar.gz (20.3 kB view details)

Uploaded Nov 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

opengovcorpus-0.1.2-py3-none-any.whl (17.0 kB view details)

Uploaded Nov 17, 2025 Python 3

File details

Details for the file opengovcorpus-0.1.2.tar.gz.

File metadata

Download URL: opengovcorpus-0.1.2.tar.gz
Upload date: Nov 17, 2025
Size: 20.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for opengovcorpus-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`66b414f84738f498895eed2f26f58cc3830ec917c4ef64b5976003b28b4ff3a1`
MD5	`a737e49e3d9cbb76c5f88c1466806f76`
BLAKE2b-256	`ed17d113dadb893f198a2598400c00298afd1dcfccaebf92beb9b2bff16bf471`

See more details on using hashes here.

File details

Details for the file opengovcorpus-0.1.2-py3-none-any.whl.

File metadata

Download URL: opengovcorpus-0.1.2-py3-none-any.whl
Upload date: Nov 17, 2025
Size: 17.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for opengovcorpus-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`70782d4dd7b2443acd65256f0104ed605c75f957dacd07e1895589db9ca783aa`
MD5	`128d0cd854dfeef2afcf0a9bf299a522`
BLAKE2b-256	`0d9a6c5b6c0d6287c939f01e46f035035c210fc50ef6ed3738714fa9229d682e`

See more details on using hashes here.

opengovcorpus 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OpenGovCorpus

Installation

Configuration

Setting Up API Keys

Usage

Import the Library

1. Create Dataset

2. Generate RAG Embeddings

Complete Example

Features

Use Cases

Requirements

License

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes