Skip to main content

HuggingFace Datasets source connector for moss-connectors.

Project description

moss-connector-huggingface

HuggingFace Datasets source connector for Moss. Streams any public or private dataset from the HuggingFace Hub directly into a Moss index via the datasets library.

Install

pip install moss-connector-huggingface

This pulls datasets as a dependency. For gated or private datasets you also need a HuggingFace account and a HF_TOKEN.

Usage — Hub dataset (streaming)

import asyncio
from moss import DocumentInfo
from moss_connector_huggingface import HuggingFaceDatasetConnector, ingest

async def main():
    source = HuggingFaceDatasetConnector(
        dataset_name="ag_news",
        split="train",
        mapper=lambda row: DocumentInfo(
            id=str(row["label"]) + "-" + row["text"][:8],
            text=row["text"],
            metadata={"category": str(row["label"])},
        ),
    )

    result = await ingest(
        source,
        project_id="your_project_id",
        project_key="your_project_key",
        index_name="ag-news",
    )
    print(f"ingested {result.doc_count} rows")

asyncio.run(main())

Use auto_id=True when you don't have a stable primary key and want Moss to generate UUID document IDs.

Usage — Local files

from moss_connector_huggingface import HuggingFaceLocalDatasetConnector, ingest

source = HuggingFaceLocalDatasetConnector(
    data_files="articles.jsonl",
    format="json",          # inferred from extension if omitted
    mapper=lambda row: DocumentInfo(
        id=row["id"],
        text=row["body"],
        metadata={"title": row["title"]},
    ),
)

Accepts any format supported by datasets: json / jsonl, csv, parquet, arrow, text.

Filtering rows

Pass a filter_fn to restrict which rows are ingested:

HuggingFaceDatasetConnector(
    dataset_name="ag_news",
    split="train",
    filter_fn=lambda row: row["label"] == 3,   # Sci/Tech only
    mapper=...,
)

The filter runs in Python after the dataset is loaded — it does not reduce download or streaming volume, but it is zero-config and works on any field.

Subsets and slices

# Wikipedia English subset
HuggingFaceDatasetConnector(
    dataset_name="wikipedia",
    name="20220301.en",          # subset/config name
    split="train[:500]",         # first 500 rows
    mapper=...,
)

# Gated dataset
HuggingFaceDatasetConnector(
    dataset_name="meta-llama/Llama-3.2-1B",
    token="hf_...",              # or set HF_TOKEN env var
    split="train",
    mapper=...,
)

Data requirements

DocumentInfo.metadata requires Dict[str, str]. HuggingFace row values can be ints, floats, lists, etc. — coerce them in your mapper:

mapper=lambda row: DocumentInfo(
    id=str(row["id"]),
    text=row["text"],
    metadata={
        "label":  str(row["label"]),          # int → str
        "score":  f"{row['score']:.4f}",      # float → str
        "tags":   ",".join(row["tags"]),      # list → str
    },
)

Layout

src/
├── __init__.py      # re-exports HuggingFaceDatasetConnector,
│                    #           HuggingFaceLocalDatasetConnector, ingest
├── connector.py     # connector classes
└── ingest.py        # ingest() — kept in sync with other connector packages

Tests

pip install -e ".[dev]"
pytest tests/test_huggingface.py -v                           # mocked, no network
pytest tests/test_integration_huggingface_moss.py -v -s       # live HF + Moss

The unit tests mock datasets.load_dataset — no HuggingFace token or network connection needed.

The integration test uses the public ag_news dataset (20-row slice) and requires MOSS_PROJECT_ID and MOSS_PROJECT_KEY. Set HF_TOKEN only for gated datasets.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moss_connector_huggingface-0.0.1.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

moss_connector_huggingface-0.0.1-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file moss_connector_huggingface-0.0.1.tar.gz.

File metadata

File hashes

Hashes for moss_connector_huggingface-0.0.1.tar.gz
Algorithm Hash digest
SHA256 0778afe4f12a0c0e286cc3be69f987ff461aadd0016a5f9ed9d9c53d4d4895ef
MD5 d943a97f9f6bd3d085bacc590d752c5f
BLAKE2b-256 f31e58253e1e74762771e16c597d84d710de9425884ce21ff0b87f72902b2f7f

See more details on using hashes here.

File details

Details for the file moss_connector_huggingface-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for moss_connector_huggingface-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a8aea1ee3efe6004ac319de45727ec7f9da28f3dfaf54c314d9ca4946c138cd8
MD5 05c78d09e74e9535b5e9c78838e68c17
BLAKE2b-256 b42563a9436a501cf49bd3088bcff962152ee7961f537ef8493c45f5faad90d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page