Python package to download and combine parts of MADOC dataset
Project description
pyMADOC
Python package to download and combine parts of MADOC dataset from Zenodo (record: 14637314). The MADOC dataset contains social media posts from multiple platforms (Reddit, Voat, Bluesky, and Koo), making it easy to study cross-platform content and community dynamics.
Features
- Easy download of platform-specific data files
- Automatic pairing of Reddit-Voat community data
- Both Python API and Command Line Interface
- Support for direct DataFrame loading
- Progress bars for downloads
- Efficient parquet file format
Installation
pip install pymadoc
Usage
As a Python Package
from pymadoc import list_available_data, download_file, download_community_pair
# List available platforms and communities
data_info = list_available_data()
print(data_info["platforms"]) # ['reddit', 'voat', 'bluesky', 'koo']
print(data_info["communities"]) # ['CringeAnarchy', 'fatpeoplehate', ...]
# Download a specific file
# For Reddit/Voat, specify both platform and community
file_path = download_file("reddit", community="funny", output_dir="data")
# For Bluesky/Koo, specify only platform
file_path = download_file("bluesky", output_dir="data")
# Load directly as DataFrame
df = download_file("reddit", community="funny", as_dataframe=True)
# Download and combine Reddit-Voat community pair
# As files
reddit_file, voat_file = download_community_pair("funny", output_dir="data")
# As combined DataFrame
combined_df = download_community_pair("funny", as_dataframe=True)
Command Line Interface
List available platforms and communities:
pymadoc list
Download a specific file:
# Reddit/Voat (requires community)
pymadoc download reddit --community funny --output-dir data
# Bluesky/Koo
pymadoc download bluesky --output-dir data
Download Reddit-Voat community pair:
pymadoc pair funny --output-dir data
Available Data
Platforms
- Reddit: Community-specific posts and comments
- Voat: Community-specific posts and comments
- Bluesky: Platform-wide posts
- Koo: Platform-wide posts
Communities (Reddit/Voat only)
- CringeAnarchy
- fatpeoplehate
- funny
- gaming
- gifs
- greatawakening
- KotakuInAction
- MensRights
- milliondollarextreme
- pics
- technology
- videos
Data Format
All files are stored in parquet format for efficient storage and fast loading. Each file contains the following columns:
- Platform-specific post/comment IDs
- Content text
- Timestamps
- User information
- Engagement metrics
Requirements
- Python 3.6 or higher
- pandas
- requests
- tqdm
Citation
If you use this package or the MADOC dataset in your research, please cite:
@dataset{madoc_dataset,
title = {MADOC: Multi-platform Archive of Digital Online Content},
author = {Tomašević, Aleksandar},
year = {2024},
publisher = {Zenodo},
doi = {10.5281/zenodo.14637314}
}
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pymadoc-0.1.1.tar.gz
.
File metadata
- Download URL: pymadoc-0.1.1.tar.gz
- Upload date:
- Size: 8.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
9bec577a185a82ecc1b7b95342f51b18488e3ca3dc6b49a817425b3968272742
|
|
MD5 |
e51ac06dfb5c6bdf525d93d8642cbde2
|
|
BLAKE2b-256 |
b8bcdb36b4cf6136a48fdd7c0ee3437b6dfd0132f1db785c11c47d947540b128
|
File details
Details for the file pymadoc-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: pymadoc-0.1.1-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
e1f68942cdcb58fe53bc039e0c5330faedd3d2bcb34bd6da58100fc351a35e4e
|
|
MD5 |
d64ca1cce6d46088e317c3d82525be23
|
|
BLAKE2b-256 |
7041e419bb3b672fa46b2b58d45bb49946d0079963bedc9182ad3c2ad46f1fbe
|