Skip to main content

Python package to download and combine parts of MADOC dataset

Project description

pyMADOC

Python package to download and combine parts of MADOC dataset from Zenodo (record: 14637314). The MADOC dataset contains social media posts from multiple platforms (Reddit, Voat, Bluesky, and Koo), making it easy to study cross-platform content and community dynamics.

Features

  • Easy download of platform-specific data files
  • Automatic pairing of Reddit-Voat community data
  • Both Python API and Command Line Interface
  • Support for direct DataFrame loading
  • Progress bars for downloads
  • Efficient parquet file format

Installation

pip install pymadoc

Usage

As a Python Package

from pymadoc import list_available_data, download_file, download_community_pair

# List available platforms and communities
data_info = list_available_data()
print(data_info["platforms"])  # ['reddit', 'voat', 'bluesky', 'koo']
print(data_info["communities"])  # ['CringeAnarchy', 'fatpeoplehate', ...]

# Download a specific file
# For Reddit/Voat, specify both platform and community
file_path = download_file("reddit", community="funny", output_dir="data")
# For Bluesky/Koo, specify only platform
file_path = download_file("bluesky", output_dir="data")

# Load directly as DataFrame
df = download_file("reddit", community="funny", as_dataframe=True)

# Download and combine Reddit-Voat community pair
# As files
reddit_file, voat_file = download_community_pair("funny", output_dir="data")
# As combined DataFrame
combined_df = download_community_pair("funny", as_dataframe=True)

Command Line Interface

List available platforms and communities:

pymadoc list

Download a specific file:

# Reddit/Voat (requires community)
pymadoc download reddit --community funny --output-dir data
# Bluesky/Koo
pymadoc download bluesky --output-dir data

Download Reddit-Voat community pair:

pymadoc pair funny --output-dir data

Available Data

Platforms

  • Reddit: Community-specific posts and comments
  • Voat: Community-specific posts and comments
  • Bluesky: Platform-wide posts
  • Koo: Platform-wide posts

Communities (Reddit/Voat only)

  • CringeAnarchy
  • fatpeoplehate
  • funny
  • gaming
  • gifs
  • greatawakening
  • KotakuInAction
  • MensRights
  • milliondollarextreme
  • pics
  • technology
  • videos

Data Format

All files are stored in parquet format for efficient storage and fast loading. Each file contains the following columns:

  • Platform-specific post/comment IDs
  • Content text
  • Timestamps
  • User information
  • Engagement metrics

Requirements

  • Python 3.6 or higher
  • pandas
  • requests
  • tqdm

Citation

If you use this package or the MADOC dataset in your research, please cite:

@dataset{madoc_dataset,
    title = {MADOC: Multi-platform Archive of Digital Online Content},
    author = {Tomašević, Aleksandar},
    year = {2024},
    publisher = {Zenodo},
    doi = {10.5281/zenodo.14637314}
}

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymadoc-0.1.1.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

pymadoc-0.1.1-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file pymadoc-0.1.1.tar.gz.

File metadata

  • Download URL: pymadoc-0.1.1.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for pymadoc-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9bec577a185a82ecc1b7b95342f51b18488e3ca3dc6b49a817425b3968272742
MD5 e51ac06dfb5c6bdf525d93d8642cbde2
BLAKE2b-256 b8bcdb36b4cf6136a48fdd7c0ee3437b6dfd0132f1db785c11c47d947540b128

See more details on using hashes here.

File details

Details for the file pymadoc-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pymadoc-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for pymadoc-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e1f68942cdcb58fe53bc039e0c5330faedd3d2bcb34bd6da58100fc351a35e4e
MD5 d64ca1cce6d46088e317c3d82525be23
BLAKE2b-256 7041e419bb3b672fa46b2b58d45bb49946d0079963bedc9182ad3c2ad46f1fbe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page