Skip to main content

Flexible Wikipedia dataset builder with sampling and pretraining support

Project description

Wikisets

PyPI version Python 3.9+ License: MIT

Flexible Wikipedia dataset builder with sampling and pretraining support. Built on top of wikipedia-monthly, providing fresh, clean Wikipedia dumps updated monthly.

Features

  • 🌍 Multi-language support - Access Wikipedia in any language
  • 📊 Flexible sampling - Use exact sizes, percentages, or prebuilt samples (1k/5k/10k)
  • Memory efficient - Reservoir sampling for large datasets
  • 🔄 Reproducible - Deterministic sampling with seeds
  • 📦 HuggingFace compatible - Subclasses datasets.Dataset
  • ✂️ Pretraining ready - Built-in text chunking with tokenizer support
  • 📝 Auto-generated cards - Comprehensive dataset documentation

Installation

pip install wikisets

Or with uv:

# Preferred: Add to your project
uv add wikisets

# Or just install
uv pip install wikisets

Quick Start

from wikisets import Wikiset, WikisetConfig

# Create a multi-language dataset
config = WikisetConfig(
    languages=[
        {"lang": "en", "size": 10000},      # 10k sample
        {"lang": "fr", "size": "50%"},      # 50% of French Wikipedia
        {"lang": "ar", "size": 0.1},        # 10% of Arabic Wikipedia
    ],
    seed=42
)

dataset = Wikiset.create(config)

# Access like any HuggingFace dataset
print(len(dataset))
print(dataset[0])

# View dataset card
print(dataset.get_card())

Configuration Options

WikisetConfig Parameters

  • languages (required): List of {lang: str, size: int|float|str} dictionaries
    • lang: Language code (e.g., "en", "fr", "ar", "simple")
    • size: Can be:
      • Integer (e.g., 1000, 5000, 10000) - Uses prebuilt samples when available
      • Percentage string (e.g., "50%") - Samples that percentage
      • Float 0-1 (e.g., 0.5) - Samples that fraction
  • date (optional, default: "latest"): Wikipedia dump date in yyyymmdd format
  • use_train_split (optional, default: False): Force sampling from full "train" split, ignoring prebuilt samples
  • shuffle (optional, default: False): Proportionally interleave languages
  • seed (optional, default: 42): Random seed for reproducibility
  • num_proc (optional): Number of parallel processes

Usage Examples

Basic Usage

from wikisets import Wikiset, WikisetConfig

config = WikisetConfig(
    languages=[{"lang": "en", "size": 5000}]
)
dataset = Wikiset.create(config)

# Wikiset is just an HF Dataset
dataset.push_to_hub("my-wiki-dataset")

Pretraining with Chunking

# Create base dataset
config = WikisetConfig(
    languages=[
        {"lang": "en", "size": 10000},
        {"lang": "ar", "size": 5000},
    ]
)
dataset = Wikiset.create(config)

# Convert to pretraining format with 2048 token chunks
pretrain_dataset = dataset.to_pretrain(
    split_token_len=2048,
    tokenizer="gpt2",
    nearest_delimiter="newline",
    num_proc=4
)

# Do whatever you want with it
pretrain_dataset.map(lambda x: x["text"].upper())

# It's still just a HuggingFace Dataset
pretrain_dataset.push_to_hub("my-wiki-pretraining-dataset")

Documentation

Builds on wikipedia-monthly

Wikisets is built on top of omarkamali/wikipedia-monthly, which provides:

  • Fresh Wikipedia dumps updated monthly
  • Clean, preprocessed text
  • 300+ languages
  • Prebuilt 1k/5k/10k samples for large languages

Wikisets adds:

  • Simple configuration-based building
  • Intelligent sampling strategies
  • Multi-language mixing
  • Pretraining utilities
  • Comprehensive dataset cards

Citation

@software{wikisets2025,
  author = {Omar Kamali},
  title = {Wikisets: Flexible Wikipedia Dataset Builder},
  year = {2025},
  url = {https://github.com/omarkamali/wikisets}
}

License

MIT License - see LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikisets-0.1.3.tar.gz (19.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikisets-0.1.3-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file wikisets-0.1.3.tar.gz.

File metadata

  • Download URL: wikisets-0.1.3.tar.gz
  • Upload date:
  • Size: 19.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wikisets-0.1.3.tar.gz
Algorithm Hash digest
SHA256 aaa54821cdaddff74b96b8ab66bcaa64e03586af6a70109e9f75ccec4fc5b19a
MD5 d4e4cae95ee5b46aa9e4b69b803e69b1
BLAKE2b-256 9c3844506dd0304ada463f48b85fdce0c3660477233080bf762197e80a85c049

See more details on using hashes here.

Provenance

The following attestation bundles were made for wikisets-0.1.3.tar.gz:

Publisher: publish.yml on omarkamali/wikisets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file wikisets-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: wikisets-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wikisets-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5c8d7e1e31b013fdbd06f13ac111c8f329404a3c6262592957c6f6ecb0f2b7e4
MD5 7e2b45f4cf219b84e48ae1fd55882c76
BLAKE2b-256 404ca0f65cca52791875db32ecef92e9da6bcb07d1c6b1a03c2661373d2d1c46

See more details on using hashes here.

Provenance

The following attestation bundles were made for wikisets-0.1.3-py3-none-any.whl:

Publisher: publish.yml on omarkamali/wikisets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page