Skip to main content

A library for creating structured datasets and RAG embeddings from government websites

Project description

OpenGovCorpus

A Python library for creating structured datasets and RAG embeddings from government websites.

Installation

We recommend using UV for fast, reliable package installation:

uv pip install opengovcorpus

Or with pip:

pip install opengovcorpus

Configuration

Setting Up API Keys

Create a configuration file to store your API credentials:

Location: ~/.opengovcorpus/config.json

{
  "provider": "openai",
  "api_key": "sk-your-api-key-here"
}

The library supports multiple embedding providers:

  • OpenAI: "provider": "openai"
  • Gemini: "provider": "gemini"
  • Hugging Face: "provider": "huggingface"

The library automatically reads this configuration file when generating embeddings.

Usage

Import the Library

import opengovcorpus as og

1. Create Dataset

Scrape a government website and create a structured knowledge graph with prompt-response pairs, automatically split into train/validation/test sets.

og.create_dataset(
    name="uk",                    # Name of the dataset folder
    url="https://data.gov.uk",       # Government website to scrape
    include_metadata=True,           # Include metadata for each prompt-response pair
    train_split=0.8,                 # 80% for training
    val_split=0.1,                   # 10% for validation
    test_split=0.1                   # 10% for testing
)

Output Structure:

OpenGovCorpus-uk/
├── train.csv
├── valid.csv
└── test.csv

Each CSV contains structured prompt-response pairs suitable for fine-tuning or RAG applications.

2. Generate RAG Embeddings

Convert your dataset into vector embeddings for retrieval-augmented generation (RAG) applications.

og.create_rag_embeddings(
    model="openai/text-embedding-3-large",           # Embedding model
    vector_db="chroma",                              # Vector database (Chroma by default)
    config_path="~/.opengovcorpus/config.json"      # Path to config file
)

Supported Models:

  • OpenAI: openai/text-embedding-3-large, openai/text-embedding-3-small
  • Gemini: gemini/text-embedding-004
  • Hugging Face: hf/sentence-transformers/all-MiniLM-L6-v2, hf/BAAI/bge-large-en-v1.5

How it works:

  1. Reads train.csv, valid.csv, and test.csv from your dataset
  2. Converts prompts (and optionally responses) into vector embeddings
  3. Stores embeddings in the specified vector database (Chroma local storage by default)
  4. Enables efficient semantic search and retrieval for RAG applications

Complete Example

import opengovcorpus as og

# Step 1: Create dataset from government website
og.create_dataset(
    name="uk-data",
    url="https://data.gov.uk",
    include_metadata=True,
    train_split=0.8,
    val_split=0.1,
    test_split=0.1
)

# Step 2: Generate embeddings for RAG
og.create_rag_embeddings(
    model="openai/text-embedding-3-large",
    vector_db="chroma",
    config_path="~/.opengovcorpus/config.json"
)

Features

  • Automated Web Scraping: Extract structured data from government websites
  • Knowledge Graph Creation: Convert scraped content into meaningful prompt-response pairs
  • Dataset Splitting: Automatic train/validation/test split configuration
  • Multi-Provider Support: Works with OpenAI, Gemini, and Hugging Face embeddings
  • Vector Database Integration: Built-in Chroma support for local embedding storage
  • RAG-Ready: Outputs are optimized for retrieval-augmented generation workflows

Use Cases

  • Building government data chatbots
  • Fine-tuning language models on official documentation
  • Creating semantic search systems for public information
  • Developing RAG applications for policy and regulation queries
  • Generating training datasets for civic tech applications

Requirements

  • Python 3.8+
  • API key for your chosen embedding provider (OpenAI, Gemini, or Hugging Face)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Support

For issues, questions, or feature requests, please open an issue on the GitHub repository.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opengovcorpus-0.1.2.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opengovcorpus-0.1.2-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file opengovcorpus-0.1.2.tar.gz.

File metadata

  • Download URL: opengovcorpus-0.1.2.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for opengovcorpus-0.1.2.tar.gz
Algorithm Hash digest
SHA256 66b414f84738f498895eed2f26f58cc3830ec917c4ef64b5976003b28b4ff3a1
MD5 a737e49e3d9cbb76c5f88c1466806f76
BLAKE2b-256 ed17d113dadb893f198a2598400c00298afd1dcfccaebf92beb9b2bff16bf471

See more details on using hashes here.

File details

Details for the file opengovcorpus-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: opengovcorpus-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for opengovcorpus-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 70782d4dd7b2443acd65256f0104ed605c75f957dacd07e1895589db9ca783aa
MD5 128d0cd854dfeef2afcf0a9bf299a522
BLAKE2b-256 0d9a6c5b6c0d6287c939f01e46f035035c210fc50ef6ed3738714fa9229d682e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page