Flexible Wikipedia dataset builder with sampling and pretraining support
Project description
Wikisets
Flexible Wikipedia dataset builder with sampling and pretraining support. Built on top of wikipedia-monthly, providing fresh, clean Wikipedia dumps updated monthly.
Features
- 🌍 Multi-language support - Access Wikipedia in any language
- 📊 Flexible sampling - Use exact sizes, percentages, or prebuilt samples (1k/5k/10k)
- ⚡ Memory efficient - Reservoir sampling for large datasets
- 🔄 Reproducible - Deterministic sampling with seeds
- 📦 HuggingFace compatible - Subclasses
datasets.Dataset - ✂️ Pretraining ready - Built-in text chunking with tokenizer support
- 📝 Auto-generated cards - Comprehensive dataset documentation
Installation
pip install wikisets
Or with uv:
# Preferred: Add to your project
uv add wikisets
# Or just install
uv pip install wikisets
Quick Start
from wikisets import Wikiset, WikisetConfig
# Create a multi-language dataset
config = WikisetConfig(
languages=[
{"lang": "en", "size": 10000}, # 10k sample
{"lang": "fr", "size": "50%"}, # 50% of French Wikipedia
{"lang": "ar", "size": 0.1}, # 10% of Arabic Wikipedia
],
seed=42
)
dataset = Wikiset.create(config)
# Access like any HuggingFace dataset
print(len(dataset))
print(dataset[0])
# View dataset card
print(dataset.get_card())
Configuration Options
WikisetConfig Parameters
- languages (required): List of
{lang: str, size: int|float|str}dictionarieslang: Language code (e.g., "en", "fr", "ar", "simple")size: Can be:- Integer (e.g.,
1000,5000,10000) - Uses prebuilt samples when available - Percentage string (e.g.,
"50%") - Samples that percentage - Float 0-1 (e.g.,
0.5) - Samples that fraction
- Integer (e.g.,
- date (optional, default:
"latest"): Wikipedia dump date in yyyymmdd format - use_train_split (optional, default:
False): Force sampling from full "train" split, ignoring prebuilt samples - shuffle (optional, default:
False): Proportionally interleave languages - seed (optional, default:
42): Random seed for reproducibility - num_proc (optional): Number of parallel processes
Usage Examples
Basic Usage
from wikisets import Wikiset, WikisetConfig
config = WikisetConfig(
languages=[{"lang": "en", "size": 5000}]
)
dataset = Wikiset.create(config)
# Wikiset is just an HF Dataset
dataset.push_to_hub("my-wiki-dataset")
Pretraining with Chunking
# Create base dataset
config = WikisetConfig(
languages=[
{"lang": "en", "size": 10000},
{"lang": "ar", "size": 5000},
]
)
dataset = Wikiset.create(config)
# Convert to pretraining format with 2048 token chunks
pretrain_dataset = dataset.to_pretrain(
split_token_len=2048,
tokenizer="gpt2",
nearest_delimiter="newline",
num_proc=4
)
# Do whatever you want with it
pretrain_dataset.map(lambda x: x["text"].upper())
# It's still just a HuggingFace Dataset
pretrain_dataset.push_to_hub("my-wiki-pretraining-dataset")
Documentation
- Quick Start Guide - Get started in 5 minutes
- API Reference - Complete API documentation
- Examples - Common usage patterns
- Technical Specification - Design and implementation details
Builds on wikipedia-monthly
Wikisets is built on top of omarkamali/wikipedia-monthly, which provides:
- Fresh Wikipedia dumps updated monthly
- Clean, preprocessed text
- 300+ languages
- Prebuilt 1k/5k/10k samples for large languages
Wikisets adds:
- Simple configuration-based building
- Intelligent sampling strategies
- Multi-language mixing
- Pretraining utilities
- Comprehensive dataset cards
Citation
@software{wikisets2025,
author = {Omar Kamali},
title = {Wikisets: Flexible Wikipedia Dataset Builder},
year = {2025},
url = {https://github.com/omarkamali/wikisets}
}
License
MIT License - see LICENSE for details.
Links
- GitHub: https://github.com/omarkamali/wikisets
- PyPI: https://pypi.org/project/wikisets/
- Documentation: https://github.com/omarkamali/wikisets/tree/main/docs
- Wikipedia Monthly: https://huggingface.co/datasets/omarkamali/wikipedia-monthly
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wikisets-0.1.3.tar.gz.
File metadata
- Download URL: wikisets-0.1.3.tar.gz
- Upload date:
- Size: 19.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aaa54821cdaddff74b96b8ab66bcaa64e03586af6a70109e9f75ccec4fc5b19a
|
|
| MD5 |
d4e4cae95ee5b46aa9e4b69b803e69b1
|
|
| BLAKE2b-256 |
9c3844506dd0304ada463f48b85fdce0c3660477233080bf762197e80a85c049
|
Provenance
The following attestation bundles were made for wikisets-0.1.3.tar.gz:
Publisher:
publish.yml on omarkamali/wikisets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wikisets-0.1.3.tar.gz -
Subject digest:
aaa54821cdaddff74b96b8ab66bcaa64e03586af6a70109e9f75ccec4fc5b19a - Sigstore transparency entry: 692894169
- Sigstore integration time:
-
Permalink:
omarkamali/wikisets@10647761a55af0b614a420b8f87cec5c1ae28ef1 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/omarkamali
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@10647761a55af0b614a420b8f87cec5c1ae28ef1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file wikisets-0.1.3-py3-none-any.whl.
File metadata
- Download URL: wikisets-0.1.3-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c8d7e1e31b013fdbd06f13ac111c8f329404a3c6262592957c6f6ecb0f2b7e4
|
|
| MD5 |
7e2b45f4cf219b84e48ae1fd55882c76
|
|
| BLAKE2b-256 |
404ca0f65cca52791875db32ecef92e9da6bcb07d1c6b1a03c2661373d2d1c46
|
Provenance
The following attestation bundles were made for wikisets-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on omarkamali/wikisets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wikisets-0.1.3-py3-none-any.whl -
Subject digest:
5c8d7e1e31b013fdbd06f13ac111c8f329404a3c6262592957c6f6ecb0f2b7e4 - Sigstore transparency entry: 692894176
- Sigstore integration time:
-
Permalink:
omarkamali/wikisets@10647761a55af0b614a420b8f87cec5c1ae28ef1 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/omarkamali
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@10647761a55af0b614a420b8f87cec5c1ae28ef1 -
Trigger Event:
release
-
Statement type: