Skip to main content

A library for creating structured datasets and RAG embeddings from government websites

Project description

OpenGovCorpus

A Python library for creating structured datasets and RAG embeddings from government websites.

Installation

We recommend using UV for fast, reliable package installation:

uv pip install opengovcorpus

Or with pip:

pip install opengovcorpus

Configuration

Setting Up API Keys

Create a configuration file to store your API credentials:

Location: ~/.opengovcorpus/config.json

{
  "provider": "openai",
  "api_key": "sk-your-api-key-here"
}

The library supports multiple embedding providers:

  • OpenAI: "provider": "openai"
  • Gemini: "provider": "gemini"
  • Hugging Face: "provider": "huggingface"

The library automatically reads this configuration file when generating embeddings.

Usage

Import the Library

import opengovcorpus as og

1. Create Dataset

Scrape a government website and create a structured knowledge graph with prompt-response pairs, automatically split into train/validation/test sets.

og.create_dataset(
    name="uk",                    # Name of the dataset folder
    url="https://data.gov.uk",       # Government website to scrape
    include_metadata=True,           # Include metadata for each prompt-response pair
    train_split=0.8,                 # 80% for training
    val_split=0.1,                   # 10% for validation
    test_split=0.1                   # 10% for testing
)

Output Structure:

OpenGovCorpus-uk/
├── train.csv
├── valid.csv
└── test.csv

Each CSV contains structured prompt-response pairs suitable for fine-tuning or RAG applications.

2. Generate RAG Embeddings

Convert your dataset into vector embeddings for retrieval-augmented generation (RAG) applications.

og.create_rag_embeddings(
    model="openai/text-embedding-3-large",           # Embedding model
    vector_db="chroma",                              # Vector database (Chroma by default)
    config_path="~/.opengovcorpus/config.json"      # Path to config file
)

Supported Models:

  • OpenAI: openai/text-embedding-3-large, openai/text-embedding-3-small
  • Gemini: gemini/text-embedding-004
  • Hugging Face: hf/sentence-transformers/all-MiniLM-L6-v2, hf/BAAI/bge-large-en-v1.5

How it works:

  1. Reads train.csv, valid.csv, and test.csv from your dataset
  2. Converts prompts (and optionally responses) into vector embeddings
  3. Stores embeddings in the specified vector database (Chroma local storage by default)
  4. Enables efficient semantic search and retrieval for RAG applications

Complete Example

import opengovcorpus as og

# Step 1: Create dataset from government website
og.create_dataset(
    name="uk-data",
    url="https://data.gov.uk",
    include_metadata=True,
    train_split=0.8,
    val_split=0.1,
    test_split=0.1
)

# Step 2: Generate embeddings for RAG
og.create_rag_embeddings(
    model="openai/text-embedding-3-large",
    vector_db="chroma",
    config_path="~/.opengovcorpus/config.json"
)

Features

  • Automated Web Scraping: Extract structured data from government websites
  • Knowledge Graph Creation: Convert scraped content into meaningful prompt-response pairs
  • Dataset Splitting: Automatic train/validation/test split configuration
  • Multi-Provider Support: Works with OpenAI, Gemini, and Hugging Face embeddings
  • Vector Database Integration: Built-in Chroma support for local embedding storage
  • RAG-Ready: Outputs are optimized for retrieval-augmented generation workflows

Use Cases

  • Building government data chatbots
  • Fine-tuning language models on official documentation
  • Creating semantic search systems for public information
  • Developing RAG applications for policy and regulation queries
  • Generating training datasets for civic tech applications

Requirements

  • Python 3.8+
  • API key for your chosen embedding provider (OpenAI, Gemini, or Hugging Face)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Support

For issues, questions, or feature requests, please open an issue on the GitHub repository.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opengovcorpus-0.1.0.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opengovcorpus-0.1.0-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file opengovcorpus-0.1.0.tar.gz.

File metadata

  • Download URL: opengovcorpus-0.1.0.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for opengovcorpus-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3f3e9161055c66ba1828ec1b38b20fac882d4ce55c7948943f503b81fe84e549
MD5 1074b05f793b4b24c530930a81af282f
BLAKE2b-256 af45ca3b1f6d6cb20d818d6d6a41ed190c0d91fa1fe91f90263503ecd2a8a0a2

See more details on using hashes here.

File details

Details for the file opengovcorpus-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: opengovcorpus-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for opengovcorpus-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee41887dc3a2e761a0acadd15d7d812417b03afc1079ade21403f0bb25eb8316
MD5 b77c289852bee922202e550123eea05e
BLAKE2b-256 459259d7f78c6126a229ea7f489ab8957b2727ff74c69e1dcf3dea8b779aa381

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page