Data loading tools for NLWeb - load schema.org JSON and RSS feeds into vector databases

These details have not been verified by PyPI

Project links

Project description

NLWeb Data Loading

Data loading tools for NLWeb - load schema.org JSON files and RSS feeds into vector databases with automatic embedding generation.

Overview

nlweb-dataload provides a simple interface for loading structured data into vector databases. It:

Loads schema.org JSON files or RSS/Atom feeds
Automatically computes embeddings for all documents
Uploads to vector databases in batches
Supports deletion by site

Installation

# Install from PyPI (when published)
pip install nlweb-dataload

# Or install from source
pip install -e packages/dataload

Quick Start

import asyncio
import nlweb_core
from nlweb_dataload import load_to_db, delete_site

# Initialize NLWeb with config
nlweb_core.init(config_path="config.yaml")

# Load schema.org JSON file
async def main():
    result = await load_to_db(
        file_path="recipes.json",
        site="seriouseats"
    )
    print(f"Loaded {result['total_loaded']} documents")

asyncio.run(main())

Configuration

Add writer configuration to your config.yaml:

# config.yaml
retrieval_endpoints:
  azure_search_prod:
    db_type: azure_ai_search
    api_endpoint: https://your-search.search.windows.net
    api_key_env: AZURE_SEARCH_KEY
    index_name: embeddings1536
    auth_method: api_key  # or azure_ad for managed identity

    # Add writer configuration
    writer:
      enabled: true
      import_path: nlweb_azure_vectordb.azure_search_writer
      class_name: AzureSearchWriter

# Set as write endpoint
write_endpoint: azure_search_prod

Usage

Load JSON File

Load a schema.org JSON file:

from nlweb_dataload import load_to_db

# Single schema.org object or array of objects
result = await load_to_db(
    file_path="data/recipes.json",
    site="seriouseats"
)

Example JSON file:

[
  {
    "@context": "http://schema.org",
    "@type": "Recipe",
    "url": "https://www.seriouseats.com/best-pasta-recipe",
    "name": "Best Pasta Ever",
    "description": "The best pasta recipe you'll ever make",
    "author": {"@type": "Person", "name": "Chef Mario"}
  }
]

Load RSS Feed

Load an RSS or Atom feed (automatically converts to schema.org Article):

from nlweb_dataload import load_to_db

# Load from URL
result = await load_to_db(
    file_path="https://example.com/feed.xml",
    site="example",
    file_type="rss"  # Optional, auto-detected
)

# Load from local file
result = await load_to_db(
    file_path="feeds/blog.xml",
    site="myblog",
    file_type="rss"
)

Delete Site Data

Remove all documents for a site:

from nlweb_dataload import delete_site

result = await delete_site(site="old-site.com")
print(f"Deleted {result['deleted_count']} documents")

Batch Upload

Control batch size for large datasets:

result = await load_to_db(
    file_path="large_dataset.json",
    site="example",
    batch_size=50  # Upload 50 documents at a time (default: 100)
)

Specify Endpoint

Use a specific endpoint instead of default write_endpoint:

result = await load_to_db(
    file_path="data.json",
    site="example",
    endpoint_name="azure_search_staging"  # Override default
)

Data Format

Schema.org JSON

Documents must include these fields:

url (required): Unique document URL
name or headline (required): Document name/title
description (optional): Used for embedding if present

Any valid schema.org type is supported (Recipe, Article, Product, Event, etc.).

RSS/Atom Feeds

RSS/Atom feeds are automatically converted to schema.org Article format with:

url: Entry link
name/headline: Entry title
description: Entry summary/content
datePublished: Publication date
author: Entry author
publisher: Feed title/link
keywords: Entry tags/categories

Architecture

Write Interface Separation

NLWeb maintains clean separation between read and write operations:

nlweb_core.retriever: Read-only search interface
nlweb_dataload.writer: Write interface (upload/delete)

This prevents accidental writes during queries and allows different access patterns.

Writer Interface

Each vector database provider implements VectorDBWriterInterface:

from nlweb_dataload.writer import VectorDBWriterInterface

class MyDatabaseWriter(VectorDBWriterInterface):
    async def upload_documents(self, documents, **kwargs):
        # Upload documents to database
        pass

    async def delete_documents(self, filter_criteria, **kwargs):
        # Delete documents matching criteria
        pass

    async def delete_site(self, site, **kwargs):
        # Delete all documents for site
        pass

Supported Vector Databases

Azure AI Search

Built-in support via nlweb-azure-vectordb:

pip install nlweb-azure-vectordb

Configuration:

retrieval_endpoints:
  azure_search:
    db_type: azure_ai_search
    writer:
      import_path: nlweb_azure_vectordb.azure_search_writer
      class_name: AzureSearchWriter

Other Databases

Create a writer class for your database:

Implement VectorDBWriterInterface
Add to config with import_path and class_name
Install provider package

See nlweb_azure_vectordb.azure_search_writer for reference implementation.

Command Line Usage

# Load JSON file
python -m nlweb_dataload.db_load \
  --file data/recipes.json \
  --site seriouseats \
  --config config.yaml

# Load RSS feed
python -m nlweb_dataload.db_load \
  --file https://example.com/feed.xml \
  --site example \
  --type rss \
  --config config.yaml

# Delete site
python -m nlweb_dataload.db_load \
  --delete-site old-site.com \
  --config config.yaml

Dependencies

nlweb-core>=0.5.0 - Core NLWeb functionality
feedparser>=6.0.0 - RSS/Atom feed parsing
aiohttp>=3.8.0 - Async HTTP for URL loading

Development

# Install in editable mode with dev dependencies
pip install -e "packages/dataload[dev]"

# Run tests
pytest packages/dataload/tests

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.5

Dec 9, 2025

0.5.4

Nov 14, 2025

0.5.3

Nov 11, 2025

0.5.2

Nov 6, 2025

0.5.1

Nov 5, 2025

0.5.0

Nov 2, 2025

This version

0.0.0

Nov 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlweb_dataload-0.0.0.tar.gz (16.5 kB view details)

Uploaded Nov 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nlweb_dataload-0.0.0-py3-none-any.whl (17.7 kB view details)

Uploaded Nov 16, 2025 Python 3

File details

Details for the file nlweb_dataload-0.0.0.tar.gz.

File metadata

Download URL: nlweb_dataload-0.0.0.tar.gz
Upload date: Nov 16, 2025
Size: 16.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for nlweb_dataload-0.0.0.tar.gz
Algorithm	Hash digest
SHA256	`8805dc14b18905aa7f3abac07051d132118cf2b3098b18b1e992a79d936a51f6`
MD5	`30b346df9353557107adc22712bb6a30`
BLAKE2b-256	`b754fb309ca16dba6d5db8785f0347ad8759165468ad5e861874f257bcb6aa98`

See more details on using hashes here.

File details

Details for the file nlweb_dataload-0.0.0-py3-none-any.whl.

File metadata

Download URL: nlweb_dataload-0.0.0-py3-none-any.whl
Upload date: Nov 16, 2025
Size: 17.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for nlweb_dataload-0.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7d6b1205d84e836b2e8c7fd0be30d45323b09aed4493facb400de4f78de39b7d`
MD5	`26771292239af17edb368ab7e25e80e2`
BLAKE2b-256	`b3f1cdd89b23d90523cdfa4247741962478b49a0e183247ef650f56b0c8f072b`

See more details on using hashes here.

nlweb-dataload 0.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NLWeb Data Loading

Overview

Installation

Quick Start

Configuration

Usage

Load JSON File

Load RSS Feed

Delete Site Data

Batch Upload

Specify Endpoint

Data Format

Schema.org JSON

RSS/Atom Feeds

Architecture

Write Interface Separation

Writer Interface

Supported Vector Databases

Azure AI Search

Other Databases

Command Line Usage

Dependencies

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes