Flexible Wikipedia dataset builder with sampling and pretraining support

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Wikisets

Flexible Wikipedia dataset builder with sampling and pretraining support. Built on top of wikipedia-monthly, providing fresh, clean Wikipedia dumps updated monthly.

Features

🌍 Multi-language support - Access Wikipedia in any language
📊 Flexible sampling - Use exact sizes, percentages, or prebuilt samples (1k/5k/10k)
⚡ Memory efficient - Reservoir sampling for large datasets
🔄 Reproducible - Deterministic sampling with seeds
📦 HuggingFace compatible - Subclasses datasets.Dataset
✂️ Pretraining ready - Built-in text chunking with tokenizer support
📝 Auto-generated cards - Comprehensive dataset documentation

Installation

pip install wikisets

Or with uv:

# Preferred: Add to your project
uv add wikisets

# Or just install
uv pip install wikisets

Quick Start

from wikisets import Wikiset, WikisetConfig

# Create a multi-language dataset
config = WikisetConfig(
    languages=[
        {"lang": "en", "size": 10000},      # 10k sample
        {"lang": "fr", "size": "50%"},      # 50% of French Wikipedia
        {"lang": "ar", "size": 0.1},        # 10% of Arabic Wikipedia
    ],
    seed=42
)

dataset = Wikiset.create(config)

# Access like any HuggingFace dataset
print(len(dataset))
print(dataset[0])

# View dataset card
print(dataset.get_card())

Configuration Options

WikisetConfig Parameters

languages (required): List of {lang: str, size: int|float|str} dictionaries
- lang: Language code (e.g., "en", "fr", "ar", "simple")
- size: Can be:
  - Integer (e.g., 1000, 5000, 10000) - Uses prebuilt samples when available
  - Percentage string (e.g., "50%") - Samples that percentage
  - Float 0-1 (e.g., 0.5) - Samples that fraction
date (optional, default: "latest"): Wikipedia dump date in yyyymmdd format
use_train_split (optional, default: False): Force sampling from full "train" split, ignoring prebuilt samples
shuffle (optional, default: False): Proportionally interleave languages
seed (optional, default: 42): Random seed for reproducibility
num_proc (optional): Number of parallel processes

Usage Examples

Basic Usage

from wikisets import Wikiset, WikisetConfig

config = WikisetConfig(
    languages=[{"lang": "en", "size": 5000}]
)
dataset = Wikiset.create(config)

# Wikiset is just an HF Dataset
dataset.push_to_hub("my-wiki-dataset")

Pretraining with Chunking

# Create base dataset
config = WikisetConfig(
    languages=[
        {"lang": "en", "size": 10000},
        {"lang": "ar", "size": 5000},
    ]
)
dataset = Wikiset.create(config)

# Convert to pretraining format with 2048 token chunks
pretrain_dataset = dataset.to_pretrain(
    split_token_len=2048,
    tokenizer="gpt2",
    nearest_delimiter="newline",
    num_proc=4
)

# Do whatever you want with it
pretrain_dataset.map(lambda x: x["text"].upper())

# It's still just a HuggingFace Dataset
pretrain_dataset.push_to_hub("my-wiki-pretraining-dataset")

Documentation

Quick Start Guide - Get started in 5 minutes
API Reference - Complete API documentation
Examples - Common usage patterns
Technical Specification - Design and implementation details

Builds on wikipedia-monthly

Wikisets is built on top of omarkamali/wikipedia-monthly, which provides:

Fresh Wikipedia dumps updated monthly
Clean, preprocessed text
300+ languages
Prebuilt 1k/5k/10k samples for large languages

Wikisets adds:

Simple configuration-based building
Intelligent sampling strategies
Multi-language mixing
Pretraining utilities
Comprehensive dataset cards

Citation

@software{wikisets2025,
  author = {Omar Kamali},
  title = {Wikisets: Flexible Wikipedia Dataset Builder},
  year = {2025},
  url = {https://github.com/omarkamali/wikisets}
}

License

MIT License - see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

omarkamali

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

Nov 11, 2025

This version

0.1.2

Oct 28, 2025

0.1.0

Oct 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikisets-0.1.2.tar.gz (18.7 kB view details)

Uploaded Oct 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wikisets-0.1.2-py3-none-any.whl (15.8 kB view details)

Uploaded Oct 28, 2025 Python 3

File details

Details for the file wikisets-0.1.2.tar.gz.

File metadata

Download URL: wikisets-0.1.2.tar.gz
Upload date: Oct 28, 2025
Size: 18.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wikisets-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`6b8aa5b7988e78fc06695c0bcacb6fe073ec384d7d94c9a4b429f138f7ff1f8b`
MD5	`2ae862e642bafc0c423067b66e557ba9`
BLAKE2b-256	`e756f90c07a68bad22e9a66ff9fe9174b07faaf6340309a3816e622e70efcbf0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for wikisets-0.1.2.tar.gz:

Publisher: publish.yml on omarkamali/wikisets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: wikisets-0.1.2.tar.gz
- Subject digest: 6b8aa5b7988e78fc06695c0bcacb6fe073ec384d7d94c9a4b429f138f7ff1f8b
- Sigstore transparency entry: 646976881
- Sigstore integration time: Oct 28, 2025
Source repository:
- Permalink: omarkamali/wikisets@36428651e464f040e7548c8f874ac0ad6c1355b6
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/omarkamali
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@36428651e464f040e7548c8f874ac0ad6c1355b6
- Trigger Event: release

File details

Details for the file wikisets-0.1.2-py3-none-any.whl.

File metadata

Download URL: wikisets-0.1.2-py3-none-any.whl
Upload date: Oct 28, 2025
Size: 15.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wikisets-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`20a7270dc476eae8016ec971dbb0fdbfe65715385527f56bcec7ba60e8a1b89e`
MD5	`3a43cefef7eb23676e8d22503651fac1`
BLAKE2b-256	`f5e0e501a1696cc36854269a1aef73d9a8c6a1b02a1c4c968fbe221cad87079f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for wikisets-0.1.2-py3-none-any.whl:

Publisher: publish.yml on omarkamali/wikisets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: wikisets-0.1.2-py3-none-any.whl
- Subject digest: 20a7270dc476eae8016ec971dbb0fdbfe65715385527f56bcec7ba60e8a1b89e
- Sigstore transparency entry: 646976885
- Sigstore integration time: Oct 28, 2025
Source repository:
- Permalink: omarkamali/wikisets@36428651e464f040e7548c8f874ac0ad6c1355b6
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/omarkamali
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@36428651e464f040e7548c8f874ac0ad6c1355b6
- Trigger Event: release

wikisets 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Wikisets

Features

Installation

Quick Start

Configuration Options

WikisetConfig Parameters

Usage Examples

Basic Usage

Pretraining with Chunking

Documentation

Builds on wikipedia-monthly

Citation

License

Links

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance