Stream HuggingFace datasets to Vespa JSON format

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

thomasht86

These details have not been verified by PyPI

Project description

hf2vespa

Stream HuggingFace datasets to Vespa JSON format

Description

A command-line tool for streaming HuggingFace datasets directly to Vespa's JSON feed format without intermediate files or loading entire datasets into memory. Define field mappings via YAML configuration or CLI arguments, then pipe output directly to vespa feed - for efficient ingestion of millions of records.

Installation

with uv

Run with uvx for a fast, isolated installation:

uvx hf2vespa

or install globally:

uv tool install hf2vespa

or with pip:

pip install hf2vespa

From Source

git clone https://github.com/thomasht86/hf2vespa.git
cd hf2vespa
uv tool install .

Requirements: Python 3.10+

Quick Start

Basic Usage

Stream a HuggingFace dataset to Vespa JSON format:

hf2vespa feed glue --config ax --split test --limit 5

Output:

{"put":"id:doc:doc::0","fields":{"premise":"The cat sat on the mat.","hypothesis":"The cat did not sit on the mat.","label":-1,"idx":0}}
{"put":"id:doc:doc::1","fields":{"premise":"The cat did not sit on the mat.","hypothesis":"The cat sat on the mat.","label":-1,"idx":1}}
{"put":"id:doc:doc::2","fields":{"premise":"When you've got no snow...","hypothesis":"When you've got snow...","label":-1,"idx":2}}

--- Completion Statistics ---
Total records processed: 5
Successful: 5
Errors: 0
Throughput: 2.1 records/sec
Elapsed time: 2.38s

Preview Dataset Schema

Inspect a dataset and generate a YAML configuration template:

hf2vespa init glue --config ax --split test --output config.yaml

Generated config.yaml:

namespace: doc
doctype: doc
id_column: # null = auto-increment

mappings:
  - source: premise
    target: premise
    type:  # string
  - source: hypothesis
    target: hypothesis
    type:  # string

Use Config File

Apply the generated configuration:

hf2vespa feed glue --config ax --split test --config-file config.yaml --limit 5

The config file defines field mappings, document IDs, and type conversions (e.g., converting lists to Vespa tensor format).

YAML Configuration

Configuration files define field mappings and document settings for Vespa feed generation.

Basic Structure

The minimal YAML configuration:

# Vespa document settings
namespace: doc              # Namespace for document IDs
doctype: doc               # Document type name
id_column:                 # Column to use as document ID (optional, auto-increment if omitted)

# Field mappings (optional - all columns included by default)
mappings:
  - source: text           # Dataset column name
    target: body           # Vespa field name (optional, defaults to source)
    type: string           # Type converter (optional)

All fields are optional. If you omit mappings, all dataset columns are included as-is. The target field defaults to the source field name if not specified.

Field Types

Type converters transform dataset values into Vespa-compatible formats.

Basic Types

Type	Purpose	Example Input	Vespa Output
`string`	Text data	`123`	`"123"`
`int`	Integer values	`"42"`	`42`
`float`	Decimal values	`"3.14"`	`3.14`
`tensor`	Vector embeddings	`[0.1, 0.2, 0.3]`	`{"values": [0.1, 0.2, 0.3]}`

Hex-Encoded Tensors (v2.0)

Memory-efficient tensor formats using hex encoding:

Type	Cell Type	Hex Chars/Value	Use Case
`tensor_int8_hex`	int8 (-128 to 127)	2	Quantized embeddings
`tensor_bfloat16_hex`	bfloat16	4	ML model weights
`tensor_float32_hex`	float32	8	Standard precision
`tensor_float64_hex`	float64	16	High precision

mappings:
  - source: quantized_embedding
    target: qvector
    type: tensor_int8_hex  # [11, 34, 3] → {"values": "0b2203"}

Scalar Types (v2.0)

Type	Purpose	Example Input	Vespa Output
`position`	Geo coordinates	`{"lat": 37.4, "lng": -122.0}`	`{"lat": 37.4, "lng": -122.0}`
`weightedset`	Term weights	`{"tag1": 10, "tag2": 5}`	`{"tag1": 10, "tag2": 5}`
`map`	Key-value pairs	`{1: "one", 2: "two"}`	`{"1": "one", "2": "two"}`

Sparse and Mixed Tensors (v2.0)

For advanced tensor structures like ColBERT-style multi-vector embeddings:

Type	Purpose	Use Case
`sparse_tensor`	Single mapped dimension	Term weights, feature importance
`mixed_tensor`	Mapped + indexed dimensions	Multi-vector embeddings
`mixed_tensor_hex`	Mapped + hex-encoded indexed	Memory-efficient multi-vectors

mappings:
  # Sparse tensor: {"word1": 0.8, "word2": 0.5} → {"cells": [{"address": {"key": "word1"}, "value": 0.8}, ...]}
  - source: term_weights
    target: weights
    type: sparse_tensor

  # Mixed tensor: {"w1": [1.0, 2.0], "w2": [3.0, 4.0]} → {"blocks": {"w1": [1.0, 2.0], "w2": [3.0, 4.0]}}
  - source: token_embeddings
    target: colbert
    type: mixed_tensor

If type is omitted, values are passed through as-is (no conversion).

Complete Examples

1. Basic text dataset (rename columns)

Simple configuration that renames columns without type conversion:

namespace: docs
doctype: article

mappings:
  - source: text
    target: body
  - source: title
    target: headline

2. Dataset with embeddings (tensor conversion)

Configuration with type conversions for embedding vectors:

namespace: search
doctype: document
id_column: doc_id

mappings:
  - source: content
    target: text
    type: string
  - source: embedding
    target: vector
    type: tensor

3. Generated config example

This is what the init command produces when you inspect a dataset schema:

namespace: doc
doctype: doc
id_column: # null = auto-increment

mappings:
  - source: premise
    target: premise
    type:  # string
  - source: hypothesis
    target: hypothesis
    type:  # string
  - source: label
    target: label
    type:  # int

The commented type hints show inferred types based on dataset schema. Uncomment and modify as needed.

Tip: Use hf2vespa init <dataset> to generate a starter config with all fields detected from the dataset schema.

CLI Reference

`hf2vespa feed`

Stream HuggingFace dataset to Vespa JSON format.

Usage:

hf2vespa feed DATASET [OPTIONS]

Arguments:

DATASET - HuggingFace dataset name (required)

Options:

--split TEXT - Dataset split to use [default: train]
--config TEXT - Dataset config name (for multi-config datasets like glue)
--include TEXT - Columns to include (repeatable, e.g., --include title --include text)
--rename TEXT - Rename columns as 'old:new' (repeatable, e.g., --rename text:body)
--namespace TEXT - Vespa namespace for document IDs [default: doc]
--doctype TEXT - Vespa document type [default: doc]
--config-file PATH - YAML configuration file for field mappings
--limit INTEGER - Process only first N records (useful for testing)
--id-column TEXT - Dataset column to use as document ID (omit for auto-increment)
--on-error [fail|skip] - Error handling mode [default: fail]
--num-workers INTEGER - Number of parallel workers for dataset loading [default: CPU count]

Examples:

Basic streaming:

hf2vespa feed glue --config ax

Stream specific split with limit:

hf2vespa feed glue --config ax --split test --limit 10

Filter specific columns:

hf2vespa feed glue --config ax --include premise --include hypothesis

Custom namespace and doctype:

hf2vespa feed squad --namespace wiki --doctype article

Use config file for complex mappings:

hf2vespa feed squad --config-file vespa-config.yaml

Skip errors instead of failing:

hf2vespa feed my-dataset --on-error skip

`hf2vespa init`

Generate a YAML config by inspecting a HuggingFace dataset schema.

Usage:

hf2vespa init DATASET [OPTIONS]

Arguments:

DATASET - HuggingFace dataset name (required)

Options:

-o, --output PATH - Output file path [default: vespa-config.yaml]
-s, --split TEXT - Dataset split to inspect [default: train]
-c, --config TEXT - Dataset config name (required for multi-config datasets)

Examples:

Generate config for a multi-config dataset:

hf2vespa init glue --config ax

Specify output file:

hf2vespa init squad --output my-config.yaml

Inspect a specific split:

hf2vespa init my-dataset --split validation --output val-config.yaml

`hf2vespa install-completion`

Install shell tab-completion for hf2vespa.

Usage:

hf2vespa install-completion [SHELL]

Arguments:

SHELL - Shell type (bash, zsh, fish). Auto-detected if omitted.

Examples:

Auto-detect shell:

hf2vespa install-completion

Explicit shell:

hf2vespa install-completion bash

After installation, restart your shell or source your shell config file (e.g., source ~/.bashrc).

Backward Compatibility

For convenience, the feed subcommand can be omitted:

# These are equivalent:
hf2vespa feed glue --config ax
hf2vespa glue --config ax

However, we recommend using the explicit feed subcommand for clarity, especially in scripts.

Cookbook

Real-world examples using public HuggingFace datasets. All commands are copy-paste ready.

Example 1: Question Answering (SQuAD)

Stream Stanford Question Answering Dataset:

# Generate config
hf2vespa init squad --output squad-config.yaml

# Preview data structure
hf2vespa feed squad --limit 3

# Full streaming with custom doctype
hf2vespa feed squad --doctype qa --namespace squad > squad-feed.jsonl

Output format: Each record contains id, title, context, question, answers fields.

Example 2: Text Classification (GLUE)

Stream GLUE benchmark tasks for NLU:

# MRPC (paraphrase detection)
hf2vespa feed glue --config mrpc --limit 5

# SST-2 (sentiment analysis)
hf2vespa feed glue --config sst2 --namespace sentiment --limit 5

# With column filtering (ax only has test split)
hf2vespa feed glue --config ax --split test --include premise --include hypothesis

Example 3: Retrieval (MS MARCO)

Stream MS MARCO passage retrieval dataset:

# Generate config to see structure
hf2vespa init ms_marco --config v1.1 --output msmarco-config.yaml

# Stream passages
hf2vespa feed ms_marco --config v1.1 --doctype passage --limit 1000

Example 4: Wikipedia

Stream Wikipedia articles:

# Check available configs (language editions)
# Use 20220301.en for English Wikipedia snapshot

hf2vespa init wikipedia --config 20220301.simple --output wiki-config.yaml
hf2vespa feed wikipedia --config 20220301.simple --limit 100 --doctype article

Note: Full Wikipedia is large. Use --limit for testing.

Example 5: Custom Embeddings Dataset

For datasets with pre-computed embeddings:

# embedding-config.yaml
namespace: vectors
doctype: document
id_column: doc_id

mappings:
  - source: text
    target: content
    type: string
  - source: embedding
    target: vector
    type: tensor

hf2vespa feed your-embedding-dataset --config-file embedding-config.yaml

The tensor type converts Python lists to Vespa tensor format: {"values": [0.1, 0.2, ...]}

Example 6: Hex-Encoded Embeddings (v2.0)

For memory-efficient embedding storage, use hex-encoded tensors:

# hex-embedding-config.yaml
namespace: search
doctype: document
id_column: doc_id

mappings:
  - source: text
    target: content
    type: string
  # Full precision (8 hex chars per value)
  - source: embedding
    target: vector_f32
    type: tensor_float32_hex
  # Quantized (2 hex chars per value, 4x smaller)
  - source: quantized_embedding
    target: vector_int8
    type: tensor_int8_hex

hf2vespa feed your-embedding-dataset --config-file hex-embedding-config.yaml

Example 7: ColBERT Multi-Vector Embeddings (v2.0)

For ColBERT-style token-level embeddings:

# colbert-config.yaml
namespace: colbert
doctype: passage
id_column: passage_id

mappings:
  - source: text
    target: content
  # Token embeddings: {"token1": [0.1, 0.2, ...], "token2": [...]}
  - source: token_embeddings
    target: colbert_rep
    type: mixed_tensor_hex  # Uses float32 hex by default

The mixed_tensor_hex type supports cell_type options: int8, bfloat16, float32 (default), float64.

Example 8: Geo and Weighted Data (v2.0)

For location-aware search with term weights:

# geo-weighted-config.yaml
namespace: places
doctype: venue

mappings:
  - source: name
    target: title
  # Geo coordinates for geo-search
  - source: coordinates
    target: location
    type: position  # {"lat": 37.4, "lng": -122.0}
  # Category weights for boosting
  - source: categories
    target: category_weights
    type: weightedset  # {"restaurant": 10, "cafe": 5}

Piping to Vespa

Stream directly to a Vespa instance:

# Using vespa-cli
hf2vespa feed squad --limit 1000 | vespa feed -

# Or save and feed later
hf2vespa feed squad > feed.jsonl
vespa feed feed.jsonl

Type Reference (v2.0)

Complete reference for all supported type converters.

Basic Types

`string`

Converts any value to string.

Input: 123 → Output: "123"

`int`

Converts value to integer.

Input: "42" → Output: 42

`float`

Converts value to float.

Input: "3.14" → Output: 3.14

`tensor`

Converts list to Vespa indexed tensor (JSON array format).

Input: [0.1, 0.2, 0.3]
Output: {"values": [0.1, 0.2, 0.3]}

Hex-Encoded Tensors

Memory-efficient tensor encoding for embeddings. Values are packed as binary and hex-encoded.

`tensor_int8_hex`

8-bit signed integers (-128 to 127). 2 hex chars per value.

Input: [11, 34, 3]
Output: {"values": "0b2203"}

Use case: Quantized embeddings, reduced storage (4x smaller than float32).

`tensor_bfloat16_hex`

Brain floating point (truncated float32). 4 hex chars per value.

Input: [1.0, -1.0, 0.0]
Output: {"values": "3f80bf800000"}

Use case: ML model weights, good range with reduced precision.

`tensor_float32_hex`

IEEE 754 single precision. 8 hex chars per value.

Input: [3.14159]
Output: {"values": "40490fdb"}

Use case: Standard embedding precision.

`tensor_float64_hex`

IEEE 754 double precision. 16 hex chars per value.

Input: [3.141592653589793]
Output: {"values": "400921fb54442d18"}

Use case: High-precision scientific data.

Scalar Types

`position`

Geo coordinates for location-based search.

Input: {"lat": 37.4, "lng": -122.0}
Output: {"lat": 37.4, "lng": -122.0}

Validation: Latitude must be -90 to 90, longitude -180 to 180.

`weightedset`

Key-weight pairs for weighted search.

Input: {"tag1": 10, "tag2": 5}
Output: {"tag1": 10, "tag2": 5}

Note: Keys are stringified, weights converted to integers.

`map`

Generic key-value maps.

Input: {1: "one", 2: "two"}
Output: {"1": "one", "2": "two"}

Note: Keys are stringified.

Sparse and Mixed Tensors

`sparse_tensor`

Single mapped dimension using Vespa cells notation.

Input: {"word1": 0.8, "word2": 0.5}
Output: {"cells": [{"address": {"key": "word1"}, "value": 0.8},
                   {"address": {"key": "word2"}, "value": 0.5}]}

Use case: Term weights, feature importance scores.

`mixed_tensor`

Combined mapped + indexed dimensions using Vespa blocks notation.

Input: {"w1": [1.0, 2.0], "w2": [3.0, 4.0]}
Output: {"blocks": {"w1": [1.0, 2.0], "w2": [3.0, 4.0]}}

Use case: ColBERT-style multi-vector embeddings. Validation: All block arrays must have the same length.

`mixed_tensor_hex`

Mixed tensor with hex-encoded dense dimensions.

Input: {"w1": [11, 34, 3], "w2": [-124, 5, -1]}  (with cell_type=int8)
Output: {"blocks": {"w1": "0b2203", "w2": "8405ff"}}

Cell types: int8, bfloat16, float32 (default), float64 Use case: Memory-efficient ColBERT embeddings.

Troubleshooting

Authentication Errors

Symptom: 401 Unauthorized or 403 Forbidden when accessing private/gated datasets

Cause: HuggingFace authentication token not provided or invalid

Solution:

# Option 1: Environment variable
export HF_TOKEN=your_token_here
hf2vespa feed your-private-dataset

# Option 2: HuggingFace CLI login (persistent)
pip install huggingface_hub
huggingface-cli login

Get your token at: https://huggingface.co/settings/tokens

Memory Issues with Large Datasets

Symptom: Process killed, MemoryError, or system becomes unresponsive

Cause: Dataset too large to fit in memory (this tool uses HF datasets streaming)

Solution:

# Use --limit to process in batches
hf2vespa feed large-dataset --limit 10000 > batch1.jsonl
hf2vespa feed large-dataset --limit 10000 --skip 10000 > batch2.jsonl

# Or pipe directly to Vespa (recommended)
hf2vespa feed large-dataset | vespa feed -

Note: HuggingFace datasets library handles streaming efficiently. Memory issues are rare but can occur with very wide datasets (many columns) or large individual records.

Type Conversion Errors

Symptom: TypeError or ValueError during feed generation

Cause: Column type doesn't match expected converter (e.g., tensor on non-list field)

Solution:

Check your YAML config mappings

Verify the source column type with init:

hf2vespa init your-dataset --config your-config

Match converter type to actual data:
- Use tensor only for list/sequence columns (embeddings)
- Use string, int, float for scalar values

Multi-Config Dataset Errors

Symptom: "This dataset has multiple configurations" error

Cause: Dataset requires a --config argument (like glue, super_glue, etc.)

Solution:

# List available configs (check HuggingFace dataset page)
# Then specify one:
hf2vespa feed glue --config ax
hf2vespa init glue --config cola

Dataset Not Found

Symptom: DatasetNotFoundError or 404 error

Cause: Dataset name misspelled, private without auth, or doesn't exist

Solution:

Verify dataset exists on HuggingFace Hub
Check spelling (case-sensitive)
For private datasets, ensure HF_TOKEN is set (see Authentication Errors)

Contributing

Issues and pull requests are welcome. Please open an issue to discuss major changes before submitting PRs.

License

MIT License - see repository for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

thomasht86

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Feb 5, 2026

0.1.1

Feb 5, 2026

This version

0.1.0

Feb 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf2vespa-0.1.0.tar.gz (272.1 kB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hf2vespa-0.1.0-py3-none-any.whl (26.2 kB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file hf2vespa-0.1.0.tar.gz.

File metadata

Download URL: hf2vespa-0.1.0.tar.gz
Upload date: Feb 5, 2026
Size: 272.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for hf2vespa-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`aa0d5b760b7f607924ed1abcf3460c2a80da24c0f63a56c4032c29685f276571`
MD5	`b20ca5fba3bc3d436c1a68ba72edbce7`
BLAKE2b-256	`e1d8d9fe42553cc1ac1dbfbc7df0d46eaac71263fd447d62e857f9873af2a6f5`

See more details on using hashes here.

File details

Details for the file hf2vespa-0.1.0-py3-none-any.whl.

File metadata

Download URL: hf2vespa-0.1.0-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 26.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for hf2vespa-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63cb54dd1c3b97a8573b80d562a0c36867e176196c1b4588e8a76202df3c96b8`
MD5	`1b1a6c23caa95a43d1ae3b6327991770`
BLAKE2b-256	`b14a00832843e6fa8b516c0aeab80789e1ca8bed5228a1d00e86e68d2d7fd169`

See more details on using hashes here.

hf2vespa 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

hf2vespa

Description

Installation

with uv

From Source

Quick Start

Basic Usage

Preview Dataset Schema

Use Config File

YAML Configuration

Basic Structure

Field Types

Basic Types

Hex-Encoded Tensors (v2.0)

Scalar Types (v2.0)

Sparse and Mixed Tensors (v2.0)

Complete Examples

1. Basic text dataset (rename columns)

2. Dataset with embeddings (tensor conversion)

3. Generated config example

CLI Reference

hf2vespa feed

hf2vespa init

hf2vespa install-completion

Backward Compatibility

Cookbook

Example 1: Question Answering (SQuAD)

Example 2: Text Classification (GLUE)

Example 3: Retrieval (MS MARCO)

Example 4: Wikipedia

Example 5: Custom Embeddings Dataset

Example 6: Hex-Encoded Embeddings (v2.0)

Example 7: ColBERT Multi-Vector Embeddings (v2.0)

Example 8: Geo and Weighted Data (v2.0)

Piping to Vespa

Type Reference (v2.0)

Basic Types

string

int

float

tensor

Hex-Encoded Tensors

tensor_int8_hex

tensor_bfloat16_hex

tensor_float32_hex

tensor_float64_hex

Scalar Types

position

weightedset

map

Sparse and Mixed Tensors

sparse_tensor

mixed_tensor

mixed_tensor_hex

Troubleshooting

Authentication Errors

Memory Issues with Large Datasets

Type Conversion Errors

Multi-Config Dataset Errors

Dataset Not Found

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

`hf2vespa feed`

`hf2vespa init`

`hf2vespa install-completion`

`string`

`int`

`float`

`tensor`

`tensor_int8_hex`

`tensor_bfloat16_hex`

`tensor_float32_hex`

`tensor_float64_hex`

`position`

`weightedset`

`map`

`sparse_tensor`

`mixed_tensor`

`mixed_tensor_hex`