A library for creating structured datasets and RAG embeddings from government websites
Project description
OpenGovCorpus
A Python library for creating structured datasets and RAG embeddings from government websites.
Installation
We recommend using UV for fast, reliable package installation:
uv pip install opengovcorpus
Or with pip:
pip install opengovcorpus
Configuration
Setting Up API Keys
Create a configuration file to store your API credentials:
Location: ~/.opengovcorpus/config.json
{
"provider": "openai",
"api_key": "sk-your-api-key-here"
}
The library supports multiple embedding providers:
- OpenAI:
"provider": "openai" - Gemini:
"provider": "gemini" - Hugging Face:
"provider": "huggingface"
The library automatically reads this configuration file when generating embeddings.
Usage
Import the Library
import opengovcorpus as og
1. Create Dataset
Scrape a government website and create a structured knowledge graph with prompt-response pairs, automatically split into train/validation/test sets.
og.create_dataset(
name="uk", # Name of the dataset folder
url="https://data.gov.uk", # Government website to scrape
include_metadata=True, # Include metadata for each prompt-response pair
train_split=0.8, # 80% for training
val_split=0.1, # 10% for validation
test_split=0.1 # 10% for testing
)
Output Structure:
OpenGovCorpus-uk/
├── train.csv
├── valid.csv
└── test.csv
Each CSV contains structured prompt-response pairs suitable for fine-tuning or RAG applications.
2. Generate RAG Embeddings
Convert your dataset into vector embeddings for retrieval-augmented generation (RAG) applications.
og.create_rag_embeddings(
model="openai/text-embedding-3-large", # Embedding model
vector_db="chroma", # Vector database (Chroma by default)
config_path="~/.opengovcorpus/config.json" # Path to config file
)
Supported Models:
- OpenAI:
openai/text-embedding-3-large,openai/text-embedding-3-small - Gemini:
gemini/text-embedding-004 - Hugging Face:
hf/sentence-transformers/all-MiniLM-L6-v2,hf/BAAI/bge-large-en-v1.5
How it works:
- Reads
train.csv,valid.csv, andtest.csvfrom your dataset - Converts prompts (and optionally responses) into vector embeddings
- Stores embeddings in the specified vector database (Chroma local storage by default)
- Enables efficient semantic search and retrieval for RAG applications
Complete Example
import opengovcorpus as og
# Step 1: Create dataset from government website
og.create_dataset(
name="uk-data",
url="https://data.gov.uk",
include_metadata=True,
train_split=0.8,
val_split=0.1,
test_split=0.1
)
# Step 2: Generate embeddings for RAG
og.create_rag_embeddings(
model="openai/text-embedding-3-large",
vector_db="chroma",
config_path="~/.opengovcorpus/config.json"
)
Features
- Automated Web Scraping: Extract structured data from government websites
- Knowledge Graph Creation: Convert scraped content into meaningful prompt-response pairs
- Dataset Splitting: Automatic train/validation/test split configuration
- Multi-Provider Support: Works with OpenAI, Gemini, and Hugging Face embeddings
- Vector Database Integration: Built-in Chroma support for local embedding storage
- RAG-Ready: Outputs are optimized for retrieval-augmented generation workflows
Use Cases
- Building government data chatbots
- Fine-tuning language models on official documentation
- Creating semantic search systems for public information
- Developing RAG applications for policy and regulation queries
- Generating training datasets for civic tech applications
Requirements
- Python 3.8+
- API key for your chosen embedding provider (OpenAI, Gemini, or Hugging Face)
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Support
For issues, questions, or feature requests, please open an issue on the GitHub repository.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opengovcorpus-0.1.2.tar.gz.
File metadata
- Download URL: opengovcorpus-0.1.2.tar.gz
- Upload date:
- Size: 20.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66b414f84738f498895eed2f26f58cc3830ec917c4ef64b5976003b28b4ff3a1
|
|
| MD5 |
a737e49e3d9cbb76c5f88c1466806f76
|
|
| BLAKE2b-256 |
ed17d113dadb893f198a2598400c00298afd1dcfccaebf92beb9b2bff16bf471
|
File details
Details for the file opengovcorpus-0.1.2-py3-none-any.whl.
File metadata
- Download URL: opengovcorpus-0.1.2-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70782d4dd7b2443acd65256f0104ed605c75f957dacd07e1895589db9ca783aa
|
|
| MD5 |
128d0cd854dfeef2afcf0a9bf299a522
|
|
| BLAKE2b-256 |
0d9a6c5b6c0d6287c939f01e46f035035c210fc50ef6ed3738714fa9229d682e
|