Skip to main content

Wikipedia Efficient Compression & Decompression (WikECD): optimal revision partitioning, fast retrieval, and analytics.

Project description

๐Ÿง  WikECD: Wikipedia Efficient Compression & Decompression

PyPI Python License: MIT Build Status Downloads

WikECD is a Python library and command-line toolkit for efficiently compressing and retrieving Wikipedia revision histories.
It implements a knapsack-optimized partitioning algorithm to balance storage space and retrieval time, enabling fast access to any revision without fully decompressing the entire article.


๐ŸŒ Overview

Wikipedia articles have thousands of revisions โ€” each slightly different from the previous one.
Naรฏvely storing all versions wastes storage, while delta-compression makes retrieval slow.
WikECD solves this trade-off by modeling revision storage as an optimization problem, finding an optimal partition of revisions that minimizes space under a fixed time cost constraint.


โš™๏ธ Core Idea

Given a sequence of article revisions:

$R = {r_0, r_1, ..., r_n}$

Each revision has a size:

$S = {s_i \mid s_i = ||r_i||, 0 \le i \le n}$

We can store revisions as:

  • Full copies (large space, fast retrieval), or
  • Deltas (small space, slower retrieval).

๐ŸŽฏ Objective

Find an optimal partition $( P = {p_1, p_2, ..., p_k} )$ such that:

  • Space cost $( \sum f^-(p_j) )$ is minimized
  • Time cost $( \sum t(p_j) ) โ‰ค$ fixed time budget $( C )$

This reduces to a 0/1 Knapsack problem, where:

  • Items = revision pairs
  • Profit = memory saved
  • Weight = retrieval cost
  • Capacity = maximum time cost

๐Ÿงฎ Algorithm Summary

Step Description
1๏ธโƒฃ Compute revision sizes and approximate diffs: `
2๏ธโƒฃ Compute memory saved and time cost for each diff
3๏ธโƒฃ Solve 0/1 Knapsack to select optimal delta positions
4๏ธโƒฃ Build partition set ( P ) based on selected transitions
5๏ธโƒฃ Store full revisions (anchors) and deltas (patches)
6๏ธโƒฃ Retrieval reconstructs any revision by applying minimal diffs from nearest anchor

Features

Category Feature
๐Ÿ—‚๏ธ Data Sources - Wikipedia API (with continuation & polite User-Agent)
- Wikipedia XML dump parser
๐Ÿ“ฆ Compression - Knapsack-based optimal partitioning
- Linear diff approximation
- Metadata (IDs, timestamps, partitions)
๐Ÿ’พ Storage - JSON+gzip compressed format
- Supports serialization/deserialization
๐Ÿ” Retrieval - Retrieve by index range
- Retrieve by revision ID
- Retrieve by timestamp range
โš™๏ธ CLI Tool - wikecd compress-api
- wikecd compress-xml
- wikecd retrieve
- wikecd retrieve-by-id
- wikecd retrieve-by-time
๐Ÿง  Extensibility - Pluggable diffing algorithms
- SQLite backend (planned)
- FastAPI microservice (planned)

๐Ÿงฑ Architecture

WikECD/
โ”œโ”€โ”€ sources/
โ”‚ โ”œโ”€โ”€ base.py
โ”‚ โ”œโ”€โ”€ xml_parser.py
โ”‚ โ””โ”€โ”€ api_client.py
โ”‚ # Data extraction (API, XML)
โ”œโ”€โ”€ compression/
โ”‚ โ”œโ”€โ”€ diff_utils.py
โ”‚ โ”œโ”€โ”€ knapsack.py
โ”‚ โ”œโ”€โ”€ partitioner.py
โ”‚ โ””โ”€โ”€ compressor.py
โ”‚ # Knapsack-based compression
โ”œโ”€โ”€ retrieval/
โ”‚ โ”œโ”€โ”€ retrieval.py
โ”‚ โ””โ”€โ”€ query.py
โ”‚ # Efficient revision reconstruction
โ”œโ”€โ”€ storage/
โ”‚ โ”œโ”€โ”€ compressed_store.py
โ”‚ โ””โ”€โ”€ serializer.py
โ”‚ # Compressed representation & persistence
โ”œโ”€โ”€ cli.py # Command-line interface
โ”œโ”€โ”€ examples/ # Usage demos
โ””โ”€โ”€ tests/ # Unit tests\

๐Ÿš€ Installation

# Clone the repository
git clone https://github.com/<yourusername>/WikECD.git
cd WikECD

# Install in editable/development mode
pip install -e .

Requires Python โ‰ฅ 3.9
Dependencies: requests, gzip, difflib

Usage Examples

1. Compress using Wikipedia API

wikecd compress-api \
  --title "Python (programming language)" \
  --limit 40 \
  --out python.comp.gz \
  --user-agent "WikECD/0.1 (+https://github.com/you; contact: you@example.com)"

2. Retrieve by revision indices

wikecd retrieve \
  --in python.comp.gz \
  --start 10 \
  --length 5 \
  --print

Retrieves revisions [10..15] and prints the last one.

3. Retrieve by Revision ID

wikecd retrieve-by-id \
  --in python.comp.gz \
  --ids 123456789,123456790 \
  --print

Retrieves the specified revision IDs.

4. Retrieve by Timestamp Range

wikecd retrieve-by-time \
  --in python.comp.gz \
  --start-ts 2024-01-01 \
  --end-ts 2024-02-01 \
  --print

5. Retrieve using Heuristic DP

Fast greedy:

wikecd compress-api \
  --title "Python (programming language)" \
  --limit 500 --out python.comp.gz \
  --user-agent "WikECD/0.1 (+contact: you@example.com)" \
  --solver heuristic --strategy greedy

Provable $(1 โˆ’ ๐œ€)$ $(1โˆ’ฮต)$ approx with $๐œ€ = 0.05$

$ฮต=0.05$:

wikecd compress-api \
  --title "Python (programming language)" \
  --limit 1000 --out python.comp.gz \
  --user-agent "WikECD/0.1 (+contact: you@example.com)" \
  --solver heuristic --strategy fptas --eps 0.05

Often-exact but memory-bounded:

wikecd compress-api \
  --title "Python (programming language)" \
  --limit 800 --out python.comp.gz \
  --user-agent "WikECD/0.1 (+contact: you@example.com)" \
  --solver heuristic --strategy sparse --max-states 200000

Exact DP (old behavior):

wikecd compress-api \
  --title "Python (programming language)" \
  --limit 300 --out python.comp.gz \
  --user-agent "WikECD/0.1 (+contact: you@example.com)" \
  --solver exact

Retrieves all revisions made between Jan 1 and Feb 1, 2024.

Programmatic API

Compress and save:

from WikECD.sources.api_client import MediaWikiAPISource
from WikECD.compression.compressor import compress_article
from WikECD.storage.serializer import save

src = MediaWikiAPISource(user_agent="WikECD/0.1 (+https://github.com/you; contact: you@example.com)")
revs = list(src.get_revisions(title="Python (programming language)", limit=40))
article = compress_article("Python (programming language)", revs)
base_texts = {b: revs[b].text for b in article.anchors}
save("python.comp.gz", article, base_texts)

Load and retrieve:

from WikECD.storage.serializer import load
from WikECD.retrieval.query import retrieve_by_time

article, base_texts = load("python.comp.gz")
texts = retrieve_by_time(article, base_texts, start="2024-01-01", end="2024-01-31")

Metadata Stored

Each compressed article includes:

{
  "title": "Python (programming language)",
  "anchors": [0, 7, 13, 21],
  "meta": {
    "count": 40,
    "revids": [...],
    "timestamps": [...],
    "partitions": [[0,1,2,3],[4,5,6],...]
  }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikecd-0.1.1.tar.gz (46.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikecd-0.1.1-py3-none-any.whl (51.4 kB view details)

Uploaded Python 3

File details

Details for the file wikecd-0.1.1.tar.gz.

File metadata

  • Download URL: wikecd-0.1.1.tar.gz
  • Upload date:
  • Size: 46.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for wikecd-0.1.1.tar.gz
Algorithm Hash digest
SHA256 872df2a4fe89295a040ac410ac33881065405c3e8b2a00fbe12261ebf6f7ea00
MD5 5f9a03b3f1cd8c209f3aad2410b2587b
BLAKE2b-256 d8dd69c5516a977ea9659162eeb5a1e5003bf9938ac37b9ac31915f244f67bad

See more details on using hashes here.

File details

Details for the file wikecd-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: wikecd-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 51.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for wikecd-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c5a002b0958c46c22e4c21cf51dc6cccaac17a4af95bdb000a1cecbd9a6c863d
MD5 51975ca3ec37fb79e104ed2253aae6c7
BLAKE2b-256 bffc642874b4fea32daa73af842c609d817f690b3597ba495845a54cc5baf120

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page