Wikipedia Efficient Compression & Decompression (WikECD): optimal revision partitioning, fast retrieval, and analytics.
Project description
๐ง WikECD: Wikipedia Efficient Compression & Decompression
WikECD is a Python library and command-line toolkit for efficiently compressing and retrieving Wikipedia revision histories.
It implements a knapsack-optimized partitioning algorithm to balance storage space and retrieval time, enabling fast access to any revision without fully decompressing the entire article.
๐ Overview
Wikipedia articles have thousands of revisions โ each slightly different from the previous one.
Naรฏvely storing all versions wastes storage, while delta-compression makes retrieval slow.
WikECD solves this trade-off by modeling revision storage as an optimization problem, finding an optimal partition of revisions that minimizes space under a fixed time cost constraint.
โ๏ธ Core Idea
Given a sequence of article revisions:
$R = {r_0, r_1, ..., r_n}$
Each revision has a size:
$S = {s_i \mid s_i = ||r_i||, 0 \le i \le n}$
We can store revisions as:
- Full copies (large space, fast retrieval), or
- Deltas (small space, slower retrieval).
๐ฏ Objective
Find an optimal partition $( P = {p_1, p_2, ..., p_k} )$ such that:
- Space cost $( \sum f^-(p_j) )$ is minimized
- Time cost $( \sum t(p_j) ) โค$ fixed time budget $( C )$
This reduces to a 0/1 Knapsack problem, where:
- Items = revision pairs
- Profit = memory saved
- Weight = retrieval cost
- Capacity = maximum time cost
๐งฎ Algorithm Summary
| Step | Description |
|---|---|
| 1๏ธโฃ | Compute revision sizes and approximate diffs: ` |
| 2๏ธโฃ | Compute memory saved and time cost for each diff |
| 3๏ธโฃ | Solve 0/1 Knapsack to select optimal delta positions |
| 4๏ธโฃ | Build partition set ( P ) based on selected transitions |
| 5๏ธโฃ | Store full revisions (anchors) and deltas (patches) |
| 6๏ธโฃ | Retrieval reconstructs any revision by applying minimal diffs from nearest anchor |
Features
| Category | Feature |
|---|---|
| ๐๏ธ Data Sources | - Wikipedia API (with continuation & polite User-Agent) - Wikipedia XML dump parser |
| ๐ฆ Compression | - Knapsack-based optimal partitioning - Linear diff approximation - Metadata (IDs, timestamps, partitions) |
| ๐พ Storage | - JSON+gzip compressed format - Supports serialization/deserialization |
| ๐ Retrieval | - Retrieve by index range - Retrieve by revision ID - Retrieve by timestamp range |
| โ๏ธ CLI Tool | - wikecd compress-api- wikecd compress-xml- wikecd retrieve- wikecd retrieve-by-id- wikecd retrieve-by-time |
| ๐ง Extensibility | - Pluggable diffing algorithms - SQLite backend (planned) - FastAPI microservice (planned) |
๐งฑ Architecture
WikECD/
โโโ sources/
โ โโโ base.py
โ โโโ xml_parser.py
โ โโโ api_client.py
โ # Data extraction (API, XML)
โโโ compression/
โ โโโ diff_utils.py
โ โโโ knapsack.py
โ โโโ partitioner.py
โ โโโ compressor.py
โ # Knapsack-based compression
โโโ retrieval/
โ โโโ retrieval.py
โ โโโ query.py
โ # Efficient revision reconstruction
โโโ storage/
โ โโโ compressed_store.py
โ โโโ serializer.py
โ # Compressed representation & persistence
โโโ cli.py # Command-line interface
โโโ examples/ # Usage demos
โโโ tests/ # Unit tests\
๐ Installation
# Clone the repository
git clone https://github.com/<yourusername>/WikECD.git
cd WikECD
# Install in editable/development mode
pip install -e .
Requires Python โฅ 3.9
Dependencies: requests, gzip, difflib
Usage Examples
1. Compress using Wikipedia API
wikecd compress-api \
--title "Python (programming language)" \
--limit 40 \
--out python.comp.gz \
--user-agent "WikECD/0.1 (+https://github.com/you; contact: you@example.com)"
2. Retrieve by revision indices
wikecd retrieve \
--in python.comp.gz \
--start 10 \
--length 5 \
--print
Retrieves revisions [10..15] and prints the last one.
3. Retrieve by Revision ID
wikecd retrieve-by-id \
--in python.comp.gz \
--ids 123456789,123456790 \
--print
Retrieves the specified revision IDs.
4. Retrieve by Timestamp Range
wikecd retrieve-by-time \
--in python.comp.gz \
--start-ts 2024-01-01 \
--end-ts 2024-02-01 \
--print
5. Retrieve using Heuristic DP
Fast greedy:
wikecd compress-api \
--title "Python (programming language)" \
--limit 500 --out python.comp.gz \
--user-agent "WikECD/0.1 (+contact: you@example.com)" \
--solver heuristic --strategy greedy
Provable $(1 โ ๐)$ $(1โฮต)$ approx with $๐ = 0.05$
$ฮต=0.05$:
wikecd compress-api \
--title "Python (programming language)" \
--limit 1000 --out python.comp.gz \
--user-agent "WikECD/0.1 (+contact: you@example.com)" \
--solver heuristic --strategy fptas --eps 0.05
Often-exact but memory-bounded:
wikecd compress-api \
--title "Python (programming language)" \
--limit 800 --out python.comp.gz \
--user-agent "WikECD/0.1 (+contact: you@example.com)" \
--solver heuristic --strategy sparse --max-states 200000
Exact DP (old behavior):
wikecd compress-api \
--title "Python (programming language)" \
--limit 300 --out python.comp.gz \
--user-agent "WikECD/0.1 (+contact: you@example.com)" \
--solver exact
Retrieves all revisions made between Jan 1 and Feb 1, 2024.
Programmatic API
Compress and save:
from WikECD.sources.api_client import MediaWikiAPISource
from WikECD.compression.compressor import compress_article
from WikECD.storage.serializer import save
src = MediaWikiAPISource(user_agent="WikECD/0.1 (+https://github.com/you; contact: you@example.com)")
revs = list(src.get_revisions(title="Python (programming language)", limit=40))
article = compress_article("Python (programming language)", revs)
base_texts = {b: revs[b].text for b in article.anchors}
save("python.comp.gz", article, base_texts)
Load and retrieve:
from WikECD.storage.serializer import load
from WikECD.retrieval.query import retrieve_by_time
article, base_texts = load("python.comp.gz")
texts = retrieve_by_time(article, base_texts, start="2024-01-01", end="2024-01-31")
Metadata Stored
Each compressed article includes:
{
"title": "Python (programming language)",
"anchors": [0, 7, 13, 21],
"meta": {
"count": 40,
"revids": [...],
"timestamps": [...],
"partitions": [[0,1,2,3],[4,5,6],...]
}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wikecd-0.1.1.tar.gz.
File metadata
- Download URL: wikecd-0.1.1.tar.gz
- Upload date:
- Size: 46.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
872df2a4fe89295a040ac410ac33881065405c3e8b2a00fbe12261ebf6f7ea00
|
|
| MD5 |
5f9a03b3f1cd8c209f3aad2410b2587b
|
|
| BLAKE2b-256 |
d8dd69c5516a977ea9659162eeb5a1e5003bf9938ac37b9ac31915f244f67bad
|
File details
Details for the file wikecd-0.1.1-py3-none-any.whl.
File metadata
- Download URL: wikecd-0.1.1-py3-none-any.whl
- Upload date:
- Size: 51.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5a002b0958c46c22e4c21cf51dc6cccaac17a4af95bdb000a1cecbd9a6c863d
|
|
| MD5 |
51975ca3ec37fb79e104ed2253aae6c7
|
|
| BLAKE2b-256 |
bffc642874b4fea32daa73af842c609d817f690b3597ba495845a54cc5baf120
|