A package to process data from Wikimedia using the server sent events (SSE) protocol.
Project description
Welcome to StatSpEdia
A tool written in Python 3.13 utilizing the async aiohttp package to grab and process data from Wikimedia using the server sent events (SSE) protocol.
All data is stored as individual documents in a local mongodb database.
Prerequisites
Prior to installing the python package, please install mongodb community edition on your machine using the instructions here: mongodb installation guide
Installation
To install a local copy please run:
pip install statspedia
Example Usage
Create an Instance of the WikiStream Class
from statspedia import WikiStream
import asyncio
async def main():
ws = WikiStream()
return await ws.stream()
asyncio.run(main())
Program Console Output
By default, logs will be printed to the console and stored in a folder logs/ at the root directory.
A sample log output is as follows:
2025-06-08 14:25:13,787 - statspedia.wiki_stream - DEBUG - Buffer will be cleared when chunk completes object
2025-06-08 14:25:31,036 - statspedia.wiki_stream - DEBUG - HTTP chunk does not contain full object.
2025-06-08 14:25:31,036 - statspedia.wiki_stream - DEBUG - Buffer will be cleared when chunk completes object
2025-06-08 14:25:37,384 - statspedia.wiki_stream - DEBUG - Wiki Edit List Count is 74. Clearing and Saving to MongoDB
2025-06-08 14:25:37,387 - statspedia.wiki_stream - DEBUG - A new deep copy of Wiki Edit List was created successfully
2025-06-08 14:25:37,393 - statspedia.wiki_stream - INFO - Wiki Edit List written to latest_edits collection in MongoDB
2025-06-08 14:25:37,394 - statspedia.wiki_stream - DEBUG - Wiki Edit List succesfully cleared.
2025-06-08 14:25:37,395 - statspedia.wiki_stream - DEBUG - Program started at: 2025-06-08 01:42:55.524709+00:00
2025-06-08 14:25:37,395 - statspedia.wiki_stream - DEBUG - Current hour: 2025-06-08 21:00:00+00:00
Data Schema and Basic Queries
Every server sent event from the English wikipedia is saved as a document in mongodb in a database named wiki_stream under the collection latest_changes. Every hour, the program will summarize the previous hours data in the same database in a collection named statistics. Each of these collections may be queried using the shell commands of mongosh or using the python driver pymongo.
The schema for the latest_changes documents is:
[
{
"_id": "ObjectId()",
"$schema": "/mediawiki/recentchange/1.0.0",
"meta": {
"uri": "string",
"request_id": "string",
"id": "string",
"dt": "ISODate()",
"domain": "en.wikipedia.org",
"stream": "mediawiki.recentchange",
"topic": "eqiad.mediawiki.recentchange",
"partition": "int",
"offset": "Long()"
},
"id": "int",
"type": "edit",
"namespace": "int",
"title": "string",
"title_url": "string",
"comment": "string",
"timestamp": "int",
"user": "string",
"bot": "bool",
"notify_url": "string",
"minor": "bool",
"length": { "old": "int", "new": "int" },
"revision": { "old": "int", "new": "int" },
"server_url": "https://en.wikipedia.org",
"server_name": "en.wikipedia.org",
"server_script_path": "/w",
"wiki": "enwiki",
"parsedcomment": "string",
"bytes_change": "int"
}
]
The schema for statistics is:
{
"most_data_added": {},
"most_data_removed": {},
"top_editors": {},
"top_editors_bots": {},
"all_editors": {},
"all_editors_bots": {},
"top_edited_articles": {},
"all_edited_articles": {},
"num_edited_articles": "int",
"num_editors": "int",
"num_editors_bots": "int",
"num_edits": "int",
"bytes_added": "int",
"bytes_removed": "int",
"total_bytes_change": "int",
"timestamp": "ISODate()"
}
Below are some simple examples for how to perform queries of the database using pymongo. For more information on queries please see the documentation here:
from pymongo import MongoClient
from pymongo.cursor import Cursor
from datetime import datetime, timezone
client = MongoClient(host='mongodb://127.0.0.1',port=27017)
db = client.wiki_stream
collection1 = db.statistics
collection2 = db.latest_changes
def create_cur(field: str, collection) -> Cursor:
cur = collection.find({field: {'$exists': 1}},{'_id': 0, field: 1})
return cur
def edit_count_by_user(cur: Cursor, field: str):
user_edit_dict = {}
for i in cur:
user_generator = ((k,v) for (k,v) in i[field].items())
for user,edit_count in user_generator:
try:
user_edit_dict[user] += edit_count
except KeyError:
user_edit_dict[user] = edit_count
total_unique_editors = len(user_edit_dict.keys())
sorted_user_edit_dict = dict(sorted(user_edit_dict.items(),
key=lambda item: item[1],
reverse=True)[0:10])
return sorted_user_edit_dict, total_unique_editors
def edit_count_by_document(cur: Cursor, field: str):
document_edit_dict = {}
count = 0
for i in cur:
document_title = i[field]
try:
document_edit_dict[document_title] += 1
except KeyError:
document_edit_dict[document_title] = 1
count += 1
total_documents_edited = len(document_edit_dict.keys())
sorted_document_edit_dict = dict(sorted(document_edit_dict.items(),
key=lambda item: item[1],
reverse=True)[0:10])
return sorted_document_edit_dict, total_documents_edited, count
def sum_across_all_stats(cur: Cursor, field: str):
total = 0
for i in cur:
total += i[field]
return total
cur = create_cur('all_editors', collection1)
users, num_unique_users = edit_count_by_user(cur,'all_editors')
print(f"Top Editors (Human) All Time: {users}")
print(f"Total Editors (Human) All Time: {num_unique_users}")
cur2 = create_cur('all_editors_bots', collection1)
users, num_unique_users_bots = edit_count_by_user(cur2,'all_editors_bots')
print(f"Top Editors (Bots) All Time: {users}")
print(f"Total Editors (Bots) All Time: {num_unique_users_bots}")
cur3 = create_cur('num_edits', collection1)
num_edits = sum_across_all_stats(cur3,'num_edits')
print(f"Total Edits All Time {num_edits}")
cur4 = create_cur('bytes_added', collection1)
bytes_added = sum_across_all_stats(cur4,'bytes_added')
print(f"Total MB Added All Time {bytes_added/1e6}")
cur5 = create_cur('bytes_removed', collection1)
bytes_removed = sum_across_all_stats(cur5,'bytes_removed')
print(f"Total MB Removed All Time {bytes_removed/1e6}")
cur6 = create_cur('total_bytes_change', collection1)
bytes_change = sum_across_all_stats(cur6,'total_bytes_change')
print(f"Total MB Change All Time {bytes_change/1e6}")
cur7 = create_cur('timestamp', collection1)
for i in cur7:
print(f"Data recording started on: {i['timestamp']}")
break
cur8 = create_cur('title', collection2)
top_docs_edited, total_docs_edited, count = edit_count_by_document(cur8,'title')
print(f"Most Edited Docs: {top_docs_edited}")
print(f"Total Edited Docs: {total_docs_edited}")
print(count)
cur9 = create_cur('all_edited_articles', collection1)
articles, total_articles = edit_count_by_user(cur9,'all_edited_articles')
print(f"Most Edited Docs: {articles}")
print(f"Total Edited Docs: {total_articles}")
Stopping the Program
The program may be safely stopped using ctrl + c which will cancel all active async tasks.
The following will be outputted to the console:
All tasks cancelled.
Elapsed Time: 0.0 days 0.0 hours 0.0 mins 18.9 secs
License
MIT
Project Status
In development.
Authors
John Glauber
Contact
For any questions, comments, or suggestions please reach out via email to:
John Glauber
johnbglauber@gmail.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file statspedia-0.1.0.tar.gz.
File metadata
- Download URL: statspedia-0.1.0.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66fbea2ea4c591a054c73f29d8e1a3224ab3407be63f451f70fa2dca2a10155d
|
|
| MD5 |
6643bdf89e664aa20270e7c867d10a30
|
|
| BLAKE2b-256 |
28c82a25cf6c1f34a6f000eadc64ed2b08f720e1c89d1fff15be19b87da87c2d
|
Provenance
The following attestation bundles were made for statspedia-0.1.0.tar.gz:
Publisher:
publish-python-dist-pypi.yml on jglauber/wikipediastats
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
statspedia-0.1.0.tar.gz -
Subject digest:
66fbea2ea4c591a054c73f29d8e1a3224ab3407be63f451f70fa2dca2a10155d - Sigstore transparency entry: 232671940
- Sigstore integration time:
-
Permalink:
jglauber/wikipediastats@66535acf6a324debf326f4719f97fe368c0c2f0c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jglauber
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python-dist-pypi.yml@66535acf6a324debf326f4719f97fe368c0c2f0c -
Trigger Event:
push
-
Statement type:
File details
Details for the file statspedia-0.1.0-py3-none-any.whl.
File metadata
- Download URL: statspedia-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bd76859cbd17eca6d9f5463fb7440600b32583080d922543a5f2bb94cc411a7
|
|
| MD5 |
cf42d7f7388fc4af02296ca707d0623e
|
|
| BLAKE2b-256 |
9b87b2935cf315b986f6dda95d5b1d9db92c66286658d566fa5f14dc3651a4b1
|
Provenance
The following attestation bundles were made for statspedia-0.1.0-py3-none-any.whl:
Publisher:
publish-python-dist-pypi.yml on jglauber/wikipediastats
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
statspedia-0.1.0-py3-none-any.whl -
Subject digest:
6bd76859cbd17eca6d9f5463fb7440600b32583080d922543a5f2bb94cc411a7 - Sigstore transparency entry: 232671941
- Sigstore integration time:
-
Permalink:
jglauber/wikipediastats@66535acf6a324debf326f4719f97fe368c0c2f0c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jglauber
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python-dist-pypi.yml@66535acf6a324debf326f4719f97fe368c0c2f0c -
Trigger Event:
push
-
Statement type: