Skip to main content

A library for creating structured datasets and RAG embeddings from government websites

Project description

OpenGovCorpus

A Python library for creating structured datasets and RAG embeddings from government websites.

Installation

We recommend using UV for fast, reliable package installation:

uv pip install opengovcorpus

Or with pip:

pip install opengovcorpus

Configuration

Setting Up API Keys

Create a configuration file to store your API credentials:

Location: ~/.opengovcorpus/config.json

{
  "provider": "openai",
  "api_key": "sk-your-api-key-here"
}

The library supports multiple embedding providers:

  • OpenAI: "provider": "openai"
  • Gemini: "provider": "gemini"
  • Hugging Face: "provider": "huggingface"

The library automatically reads this configuration file when generating embeddings.

Usage

Import the Library

import opengovcorpus as og

1. Create Dataset

Scrape a government website and create a structured knowledge graph with prompt-response pairs, automatically split into train/validation/test sets.

og.create_dataset(
    name="uk",                    # Name of the dataset folder
    url="https://data.gov.uk",       # Government website to scrape
    include_metadata=True,           # Include metadata for each prompt-response pair
    train_split=0.8,                 # 80% for training
    val_split=0.1,                   # 10% for validation
    test_split=0.1                   # 10% for testing
)

Output Structure:

OpenGovCorpus-uk/
├── train.csv
├── valid.csv
└── test.csv

Each CSV contains structured prompt-response pairs suitable for fine-tuning or RAG applications.

2. Generate RAG Embeddings

Convert your dataset into vector embeddings for retrieval-augmented generation (RAG) applications.

og.create_rag_embeddings(
    model="openai/text-embedding-3-large",           # Embedding model
    vector_db="chroma",                              # Vector database (Chroma by default)
    config_path="~/.opengovcorpus/config.json"      # Path to config file
)

Supported Models:

  • OpenAI: openai/text-embedding-3-large, openai/text-embedding-3-small
  • Gemini: gemini/text-embedding-004
  • Hugging Face: hf/sentence-transformers/all-MiniLM-L6-v2, hf/BAAI/bge-large-en-v1.5

How it works:

  1. Reads train.csv, valid.csv, and test.csv from your dataset
  2. Converts prompts (and optionally responses) into vector embeddings
  3. Stores embeddings in the specified vector database (Chroma local storage by default)
  4. Enables efficient semantic search and retrieval for RAG applications

Complete Example

import opengovcorpus as og

# Step 1: Create dataset from government website
og.create_dataset(
    name="uk-data",
    url="https://data.gov.uk",
    include_metadata=True,
    train_split=0.8,
    val_split=0.1,
    test_split=0.1
)

# Step 2: Generate embeddings for RAG
og.create_rag_embeddings(
    model="openai/text-embedding-3-large",
    vector_db="chroma",
    config_path="~/.opengovcorpus/config.json"
)

Features

  • Automated Web Scraping: Extract structured data from government websites
  • Knowledge Graph Creation: Convert scraped content into meaningful prompt-response pairs
  • Dataset Splitting: Automatic train/validation/test split configuration
  • Multi-Provider Support: Works with OpenAI, Gemini, and Hugging Face embeddings
  • Vector Database Integration: Built-in Chroma support for local embedding storage
  • RAG-Ready: Outputs are optimized for retrieval-augmented generation workflows

Use Cases

  • Building government data chatbots
  • Fine-tuning language models on official documentation
  • Creating semantic search systems for public information
  • Developing RAG applications for policy and regulation queries
  • Generating training datasets for civic tech applications

Requirements

  • Python 3.8+
  • API key for your chosen embedding provider (OpenAI, Gemini, or Hugging Face)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Support

For issues, questions, or feature requests, please open an issue on the GitHub repository.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opengovcorpus-0.1.1.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opengovcorpus-0.1.1-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file opengovcorpus-0.1.1.tar.gz.

File metadata

  • Download URL: opengovcorpus-0.1.1.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for opengovcorpus-0.1.1.tar.gz
Algorithm Hash digest
SHA256 794b0b2a52b84c6548fb479e2b7d82ce80f8eb72a7021180141306aefc423162
MD5 0b5b90ee5f66ac30d63cdcbb9105fc19
BLAKE2b-256 106a0248979aa58038f4e830eb666dd5bfaea4ced130a86f9b183a607075673c

See more details on using hashes here.

File details

Details for the file opengovcorpus-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: opengovcorpus-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for opengovcorpus-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8929b5665c0b968e4e539f7214af8b79f2c6400a7bcf1f9216e4d80e0f7d4c31
MD5 d9ddf585ef524e486bbee5b6497afde6
BLAKE2b-256 4a79e9d18e9e11d49f2409d91b9e42dc03ee1c07ee08afb437ab6f3efd5dfc9c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page