Skip to main content

A package for crawling Mastodon instances

Project description

Mastodoner

Mastodoner is a command line tool (and Python library) for archiving Mastodon, a decentralized micro-blogging social network. Mastodoner does not currently require any authentication with Mastodon; it gathers all data through publicly accessible endpoints.

Installation

Install the mastodoner with pip:

pip install -U mastodoner

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

Note: Mastodoner requires Python 3.8 or higher.

CLI Usage

usage: mastodoner [-h] {version,instance,user,status,discover} ...

Crawl public data from Mastodon instance and save to a JSON Lines file.

positional arguments:
  {version,instance,user,status,discover}
    version             Check mastodoner version
    instance            Crawl instance endpoints
    user                Crawl user endpoints
    status              Crawl status endpoints
    discover            Discover instances

optional arguments:
  -h, --help            show this help message and exit

There are four main commands: instance, user, status, discover. Here is how you use each of them:

  • instance
usage: mastodoner instance [-h] --instance-url INSTANCE_URL [--node-info] [--info] [--peers] [--activity] [--rules] [--blocks] [--trends]
                           [--trend-type {tags,statuses,links}] [--directory] [--order {new,active}] [--include-remote] [--timeline] [--only-local]
                           [--only-remote] [--only-media] [--limit LIMIT]
                           output_file

positional arguments:
  output_file           Output file (JSON Lines format)

optional arguments:
  -h, --help            show this help message and exit
  --instance-url INSTANCE_URL
                        Base URL of the Mastodon instance e.g. mastodon.online
  --node-info           Crawl node information of the instance
  --info                Crawl general information about the instance
  --peers               Crawl list of instance(s) that given instance is aware of
  --activity            Crawl instance activity over the last 3 months (binned weekly)
  --rules               Crawl rules that the users of given instance should follow
  --blocks              Crawl list of instance(s) blocked by given instance
  --trends              Crawl (hash)tags, statuses or links that trended within the past week on given instance. Use --type to specify the trend type
                        (default: tags)
  --trend-type {tags,statuses,links}
                        Optional argument used with --trends to specify the type of trend (default: tags)
  --directory           Crawl instance directory i.e. user profiles
  --order {new,active}  Optional argument used with --directory to specify the order of response (default: active)
  --include-remote      Optional argument used with --directory to include remote users
  --timeline            Crawl public timeline of given instance
  --only-local          Optional argument used with --timeline to crawl only local statuses
  --only-remote         Optional argument used with --timeline to crawl only remote statuses
  --only-media          Optional argument used with --timeline to filter out statuses without attachments
  --limit LIMIT         Optional argument used with --trends, --directory or --timeline to limit the response
  • user
usage: mastodoner user [-h] --username USERNAME [--info] [--statuses] [--followers] [--following] [--limit LIMIT] [--only-media] [--exclude-replies]
                       [--exclude-reblogs] [--only-pinned]
                       output_file

positional arguments:
  output_file          Output file (JSON Lines format)

optional arguments:
  -h, --help           show this help message and exit
  --username USERNAME  Username of the Mastodon user in the format user@domain e.g. ignactro@mastodon.social
  --info               Crawl user profile information
  --statuses           Crawl statuses posted by the given user
  --followers          Crawl users which follow the given user
  --following          Crawl users which the given user follows
  --limit LIMIT        Optional argument used with --statuses, --followers or --following to limit the response
  --only-media         Optional argument used with --statuses to filter out statuses without attachments
  --exclude-replies    Optional argument used with --statuses to filter out statuses in reply to a different user
  --exclude-reblogs    Optional argument used with --statuses to filter out reblogs (reposts)
  --only-pinned        Optional argument used with --statuses to filter pinned statuses only
  • status
usage: mastodoner status [-h] --instance-url INSTANCE_URL --status-id STATUS_ID --info output_file

positional arguments:
  output_file           Output file (JSON Lines format)

optional arguments:
  -h, --help            show this help message and exit
  --instance-url INSTANCE_URL
                        Base URL of the Mastodon instance e.g. mastodon.online
  --status-id STATUS_ID
                        ID of the status on Mastodon instance
  --info                Crawl information about the status
  • discover
usage: mastodoner discover [-h] [--bearer-token BEARER_TOKEN] [--count COUNT] [--include-dead] [--include-down] [--include-closed]
                           [--min-users MIN_USERS] [--max-users MAX_USERS]
                           output_file

positional arguments:
  output_file           Output file (JSON Lines format)

optional arguments:
  -h, --help            show this help message and exit
  --bearer-token BEARER_TOKEN
                        Bearer token for the 'instances.social' API. Alternatively, bearer token can also be provided by setting the
                        'INSTANCES_SOCIAL_TOKEN' environment variable. You can get the token from 'https://instances.social/api/doc/'
  --count COUNT         Number of instances to discover. Value between 1 and 10000 (default: 0 i.e. all instances)
  --include-dead        Include dead (down for at least two weeks) instances
  --include-down        Include down instances
  --include-closed      Include instances with closed registrations
  --min-users MIN_USERS
                        Minimum users discovered instances must have. Value greater than or equal to 1
  --max-users MAX_USERS
                        Maximum users discovered instances must have. Value greater than or equal to 1

Python Usage

You can also use Mastodoner as a Python library. For example, here's how you can crawl a user's info:

from mastodoner import Crawler
crawler = Crawler()
user_info = crawler.user_lookup('ignactro@mastodon.social')

For more examples of using Mastodoner as a Python library, check out the Colab. Open In Colab

Intended Use

Mastodoner is developed to support academic research. The data collected through Mastodoner should only be used for research purposes. Mastodoner gathers all data through publicly accessible endpoints and strictly adheres to the rate limits set for each endpoint.

Contributing

Contributions are welcome! For small bug fixes and minor improvements, feel free to just open a PR. For larger changes, please open an issue first so that other contributors can discuss your plan, avoid duplicated work, and ensure it aligns with the goals of the project. Thanks!

Reference

If you use this tool in any of your work, please cite below paper.

@inproceedings{10.1145/3627673.3679217,
author = {Zia, Haris Bin and Castro, Ignacio and Tyson, Gareth},
title = {Mastodoner: A Command-line Tool and Python Library for Public Data Collection from Mastodon},
year = {2024},
isbn = {9798400704369},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3627673.3679217},
doi = {10.1145/3627673.3679217},
booktitle = {Proceedings of the 33rd ACM International Conference on Information and Knowledge Management},
pages = {5314–5317},
numpages = {4},
location = {Boise, ID, USA},
series = {CIKM '24}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mastodoner-0.0.2.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

mastodoner-0.0.2-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file mastodoner-0.0.2.tar.gz.

File metadata

  • Download URL: mastodoner-0.0.2.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for mastodoner-0.0.2.tar.gz
Algorithm Hash digest
SHA256 f80918a881f833b21689b569442150ff772d398fbd34201af15543f00744a54a
MD5 65ca1693386bcb1deaab54dbc6dd0de7
BLAKE2b-256 ac38ea2728f5987d6baa8cc8cb733600f6759db672c83683afd1bb9c94180b09

See more details on using hashes here.

File details

Details for the file mastodoner-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: mastodoner-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for mastodoner-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 084ee4308bac2d8f7254f20a9051238e81f6aa12d1741e1d9991f5ea473b1595
MD5 3fde222c47be887caba469bf90934bcc
BLAKE2b-256 d947cd97010af27131fbe04e8e414a38b80f799c66ab8b6354db71b4afd327a3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page