Data loading tools for NLWeb - load schema.org JSON and RSS feeds into vector databases
Project description
NLWeb Data Loading
Data loading tools for NLWeb - load schema.org JSON files and RSS feeds into vector databases with automatic embedding generation.
Overview
nlweb-dataload provides a simple interface for loading structured data into vector databases. It:
- Loads schema.org JSON files or RSS/Atom feeds
- Automatically computes embeddings for all documents
- Uploads to vector databases in batches
- Supports deletion by site
Installation
# Install from PyPI (when published)
pip install nlweb-dataload
# Or install from source
pip install -e packages/dataload
Quick Start
import asyncio
import nlweb_core
from nlweb_dataload import load_to_db, delete_site
# Initialize NLWeb with config
nlweb_core.init(config_path="config.yaml")
# Load schema.org JSON file
async def main():
result = await load_to_db(
file_path="recipes.json",
site="seriouseats"
)
print(f"Loaded {result['total_loaded']} documents")
asyncio.run(main())
Configuration
Add writer configuration to your config.yaml:
# config.yaml
retrieval_endpoints:
azure_search_prod:
db_type: azure_ai_search
api_endpoint: https://your-search.search.windows.net
api_key_env: AZURE_SEARCH_KEY
index_name: embeddings1536
auth_method: api_key # or azure_ad for managed identity
# Add writer configuration
writer:
enabled: true
import_path: nlweb_azure_vectordb.azure_search_writer
class_name: AzureSearchWriter
# Set as write endpoint
write_endpoint: azure_search_prod
Usage
Load JSON File
Load a schema.org JSON file:
from nlweb_dataload import load_to_db
# Single schema.org object or array of objects
result = await load_to_db(
file_path="data/recipes.json",
site="seriouseats"
)
Example JSON file:
[
{
"@context": "http://schema.org",
"@type": "Recipe",
"url": "https://www.seriouseats.com/best-pasta-recipe",
"name": "Best Pasta Ever",
"description": "The best pasta recipe you'll ever make",
"author": {"@type": "Person", "name": "Chef Mario"}
}
]
Load RSS Feed
Load an RSS or Atom feed (automatically converts to schema.org Article):
from nlweb_dataload import load_to_db
# Load from URL
result = await load_to_db(
file_path="https://example.com/feed.xml",
site="example",
file_type="rss" # Optional, auto-detected
)
# Load from local file
result = await load_to_db(
file_path="feeds/blog.xml",
site="myblog",
file_type="rss"
)
Delete Site Data
Remove all documents for a site:
from nlweb_dataload import delete_site
result = await delete_site(site="old-site.com")
print(f"Deleted {result['deleted_count']} documents")
Batch Upload
Control batch size for large datasets:
result = await load_to_db(
file_path="large_dataset.json",
site="example",
batch_size=50 # Upload 50 documents at a time (default: 100)
)
Specify Endpoint
Use a specific endpoint instead of default write_endpoint:
result = await load_to_db(
file_path="data.json",
site="example",
endpoint_name="azure_search_staging" # Override default
)
Data Format
Schema.org JSON
Documents must include these fields:
url(required): Unique document URLnameorheadline(required): Document name/titledescription(optional): Used for embedding if present
Any valid schema.org type is supported (Recipe, Article, Product, Event, etc.).
RSS/Atom Feeds
RSS/Atom feeds are automatically converted to schema.org Article format with:
url: Entry linkname/headline: Entry titledescription: Entry summary/contentdatePublished: Publication dateauthor: Entry authorpublisher: Feed title/linkkeywords: Entry tags/categories
Architecture
Write Interface Separation
NLWeb maintains clean separation between read and write operations:
nlweb_core.retriever: Read-only search interfacenlweb_dataload.writer: Write interface (upload/delete)
This prevents accidental writes during queries and allows different access patterns.
Writer Interface
Each vector database provider implements VectorDBWriterInterface:
from nlweb_dataload.writer import VectorDBWriterInterface
class MyDatabaseWriter(VectorDBWriterInterface):
async def upload_documents(self, documents, **kwargs):
# Upload documents to database
pass
async def delete_documents(self, filter_criteria, **kwargs):
# Delete documents matching criteria
pass
async def delete_site(self, site, **kwargs):
# Delete all documents for site
pass
Supported Vector Databases
Azure AI Search
Built-in support via nlweb-azure-vectordb:
pip install nlweb-azure-vectordb
Configuration:
retrieval_endpoints:
azure_search:
db_type: azure_ai_search
writer:
import_path: nlweb_azure_vectordb.azure_search_writer
class_name: AzureSearchWriter
Other Databases
Create a writer class for your database:
- Implement
VectorDBWriterInterface - Add to config with
import_pathandclass_name - Install provider package
See nlweb_azure_vectordb.azure_search_writer for reference implementation.
Command Line Usage
# Load JSON file
python -m nlweb_dataload.db_load \
--file data/recipes.json \
--site seriouseats \
--config config.yaml
# Load RSS feed
python -m nlweb_dataload.db_load \
--file https://example.com/feed.xml \
--site example \
--type rss \
--config config.yaml
# Delete site
python -m nlweb_dataload.db_load \
--delete-site old-site.com \
--config config.yaml
Dependencies
nlweb-core>=0.5.0- Core NLWeb functionalityfeedparser>=6.0.0- RSS/Atom feed parsingaiohttp>=3.8.0- Async HTTP for URL loading
Development
# Install in editable mode with dev dependencies
pip install -e "packages/dataload[dev]"
# Run tests
pytest packages/dataload/tests
License
MIT License - Copyright (c) 2025 Microsoft Corporation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlweb_dataload-0.0.0.tar.gz.
File metadata
- Download URL: nlweb_dataload-0.0.0.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8805dc14b18905aa7f3abac07051d132118cf2b3098b18b1e992a79d936a51f6
|
|
| MD5 |
30b346df9353557107adc22712bb6a30
|
|
| BLAKE2b-256 |
b754fb309ca16dba6d5db8785f0347ad8759165468ad5e861874f257bcb6aa98
|
File details
Details for the file nlweb_dataload-0.0.0-py3-none-any.whl.
File metadata
- Download URL: nlweb_dataload-0.0.0-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d6b1205d84e836b2e8c7fd0be30d45323b09aed4493facb400de4f78de39b7d
|
|
| MD5 |
26771292239af17edb368ab7e25e80e2
|
|
| BLAKE2b-256 |
b3f1cdd89b23d90523cdfa4247741962478b49a0e183247ef650f56b0c8f072b
|