Skip to main content

Flexible Wikipedia dataset builder with sampling and pretraining support

Project description

Wikisets

PyPI version Python 3.9+ License: MIT

Flexible Wikipedia dataset builder with sampling and pretraining support. Built on top of wikipedia-monthly, providing fresh, clean Wikipedia dumps updated monthly.

Features

  • 🌍 Multi-language support - Access Wikipedia in any language
  • 📊 Flexible sampling - Use exact sizes, percentages, or prebuilt samples (1k/5k/10k)
  • Memory efficient - Reservoir sampling for large datasets
  • 🔄 Reproducible - Deterministic sampling with seeds
  • 📦 HuggingFace compatible - Subclasses datasets.Dataset
  • ✂️ Pretraining ready - Built-in text chunking with tokenizer support
  • 📝 Auto-generated cards - Comprehensive dataset documentation

Installation

pip install wikisets

Or with uv:

# Preferred: Add to your project
uv add wikisets

# Or just install
uv pip install wikisets

Quick Start

from wikisets import Wikiset, WikisetConfig

# Create a multi-language dataset
config = WikisetConfig(
    languages=[
        {"lang": "en", "size": 10000},      # 10k sample
        {"lang": "fr", "size": "50%"},      # 50% of French Wikipedia
        {"lang": "ar", "size": 0.1},        # 10% of Arabic Wikipedia
    ],
    seed=42
)

dataset = Wikiset.create(config)

# Access like any HuggingFace dataset
print(len(dataset))
print(dataset[0])

# View dataset card
print(dataset.get_card())

Configuration Options

WikisetConfig Parameters

  • languages (required): List of {lang: str, size: int|float|str} dictionaries
    • lang: Language code (e.g., "en", "fr", "ar", "simple")
    • size: Can be:
      • Integer (e.g., 1000, 5000, 10000) - Uses prebuilt samples when available
      • Percentage string (e.g., "50%") - Samples that percentage
      • Float 0-1 (e.g., 0.5) - Samples that fraction
  • date (optional, default: "latest"): Wikipedia dump date in yyyymmdd format
  • use_train_split (optional, default: False): Force sampling from full "train" split, ignoring prebuilt samples
  • shuffle (optional, default: False): Proportionally interleave languages
  • seed (optional, default: 42): Random seed for reproducibility
  • num_proc (optional): Number of parallel processes

Usage Examples

Basic Usage

from wikisets import Wikiset, WikisetConfig

config = WikisetConfig(
    languages=[{"lang": "en", "size": 5000}]
)
dataset = Wikiset.create(config)

# Wikiset is just an HF Dataset
dataset.push_to_hub("my-wiki-dataset")

Pretraining with Chunking

# Create base dataset
config = WikisetConfig(
    languages=[
        {"lang": "en", "size": 10000},
        {"lang": "ar", "size": 5000},
    ]
)
dataset = Wikiset.create(config)

# Convert to pretraining format with 2048 token chunks
pretrain_dataset = dataset.to_pretrain(
    split_token_len=2048,
    tokenizer="gpt2",
    nearest_delimiter="newline",
    num_proc=4
)

# Do whatever you want with it
pretrain_dataset.map(lambda x: x["text"].upper())

# It's still just a HuggingFace Dataset
pretrain_dataset.push_to_hub("my-wiki-pretraining-dataset")

Documentation

Builds on wikipedia-monthly

Wikisets is built on top of omarkamali/wikipedia-monthly, which provides:

  • Fresh Wikipedia dumps updated monthly
  • Clean, preprocessed text
  • 300+ languages
  • Prebuilt 1k/5k/10k samples for large languages

Wikisets adds:

  • Simple configuration-based building
  • Intelligent sampling strategies
  • Multi-language mixing
  • Pretraining utilities
  • Comprehensive dataset cards

Citation

@software{wikisets2025,
  author = {Omar Kamali},
  title = {Wikisets: Flexible Wikipedia Dataset Builder},
  year = {2025},
  url = {https://github.com/omarkamali/wikisets}
}

License

MIT License - see LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikisets-0.1.2.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikisets-0.1.2-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file wikisets-0.1.2.tar.gz.

File metadata

  • Download URL: wikisets-0.1.2.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wikisets-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6b8aa5b7988e78fc06695c0bcacb6fe073ec384d7d94c9a4b429f138f7ff1f8b
MD5 2ae862e642bafc0c423067b66e557ba9
BLAKE2b-256 e756f90c07a68bad22e9a66ff9fe9174b07faaf6340309a3816e622e70efcbf0

See more details on using hashes here.

Provenance

The following attestation bundles were made for wikisets-0.1.2.tar.gz:

Publisher: publish.yml on omarkamali/wikisets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file wikisets-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: wikisets-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wikisets-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 20a7270dc476eae8016ec971dbb0fdbfe65715385527f56bcec7ba60e8a1b89e
MD5 3a43cefef7eb23676e8d22503651fac1
BLAKE2b-256 f5e0e501a1696cc36854269a1aef73d9a8c6a1b02a1c4c968fbe221cad87079f

See more details on using hashes here.

Provenance

The following attestation bundles were made for wikisets-0.1.2-py3-none-any.whl:

Publisher: publish.yml on omarkamali/wikisets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page