HuggingFace Datasets source connector for moss-connectors.
Project description
moss-connector-huggingface
HuggingFace Datasets source connector for Moss. Streams any public or private
dataset from the HuggingFace Hub directly
into a Moss index via the datasets
library.
Install
pip install moss-connector-huggingface
This pulls datasets as a dependency. For gated or private datasets you also
need a HuggingFace account and a HF_TOKEN.
Usage — Hub dataset (streaming)
import asyncio
from moss import DocumentInfo
from moss_connector_huggingface import HuggingFaceDatasetConnector, ingest
async def main():
source = HuggingFaceDatasetConnector(
dataset_name="ag_news",
split="train",
mapper=lambda row: DocumentInfo(
id=str(row["label"]) + "-" + row["text"][:8],
text=row["text"],
metadata={"category": str(row["label"])},
),
)
result = await ingest(
source,
project_id="your_project_id",
project_key="your_project_key",
index_name="ag-news",
)
print(f"ingested {result.doc_count} rows")
asyncio.run(main())
Use auto_id=True when you don't have a stable primary key and want Moss to
generate UUID document IDs.
Usage — Local files
from moss_connector_huggingface import HuggingFaceLocalDatasetConnector, ingest
source = HuggingFaceLocalDatasetConnector(
data_files="articles.jsonl",
format="json", # inferred from extension if omitted
mapper=lambda row: DocumentInfo(
id=row["id"],
text=row["body"],
metadata={"title": row["title"]},
),
)
Accepts any format supported by datasets: json / jsonl, csv, parquet,
arrow, text.
Filtering rows
Pass a filter_fn to restrict which rows are ingested:
HuggingFaceDatasetConnector(
dataset_name="ag_news",
split="train",
filter_fn=lambda row: row["label"] == 3, # Sci/Tech only
mapper=...,
)
The filter runs in Python after the dataset is loaded — it does not reduce download or streaming volume, but it is zero-config and works on any field.
Subsets and slices
# Wikipedia English subset
HuggingFaceDatasetConnector(
dataset_name="wikipedia",
name="20220301.en", # subset/config name
split="train[:500]", # first 500 rows
mapper=...,
)
# Gated dataset
HuggingFaceDatasetConnector(
dataset_name="meta-llama/Llama-3.2-1B",
token="hf_...", # or set HF_TOKEN env var
split="train",
mapper=...,
)
Data requirements
DocumentInfo.metadata requires Dict[str, str]. HuggingFace row values can
be ints, floats, lists, etc. — coerce them in your mapper:
mapper=lambda row: DocumentInfo(
id=str(row["id"]),
text=row["text"],
metadata={
"label": str(row["label"]), # int → str
"score": f"{row['score']:.4f}", # float → str
"tags": ",".join(row["tags"]), # list → str
},
)
Layout
src/
├── __init__.py # re-exports HuggingFaceDatasetConnector,
│ # HuggingFaceLocalDatasetConnector, ingest
├── connector.py # connector classes
└── ingest.py # ingest() — kept in sync with other connector packages
Tests
pip install -e ".[dev]"
pytest tests/test_huggingface.py -v # mocked, no network
pytest tests/test_integration_huggingface_moss.py -v -s # live HF + Moss
The unit tests mock datasets.load_dataset — no HuggingFace token or network
connection needed.
The integration test uses the public ag_news dataset (20-row slice) and
requires MOSS_PROJECT_ID and MOSS_PROJECT_KEY. Set HF_TOKEN only for
gated datasets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file moss_connector_huggingface-0.0.1.tar.gz.
File metadata
- Download URL: moss_connector_huggingface-0.0.1.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0778afe4f12a0c0e286cc3be69f987ff461aadd0016a5f9ed9d9c53d4d4895ef
|
|
| MD5 |
d943a97f9f6bd3d085bacc590d752c5f
|
|
| BLAKE2b-256 |
f31e58253e1e74762771e16c597d84d710de9425884ce21ff0b87f72902b2f7f
|
File details
Details for the file moss_connector_huggingface-0.0.1-py3-none-any.whl.
File metadata
- Download URL: moss_connector_huggingface-0.0.1-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8aea1ee3efe6004ac319de45727ec7f9da28f3dfaf54c314d9ca4946c138cd8
|
|
| MD5 |
05c78d09e74e9535b5e9c78838e68c17
|
|
| BLAKE2b-256 |
b42563a9436a501cf49bd3088bcff962152ee7961f537ef8493c45f5faad90d7
|