Skip to main content

Collect and export full GitHub user profile data for everyone associated with a repository.

Project description

repo-people

PyPI version Platforms PythonV Documentation Status License: MIT Issues codecov

repo-people logo

repo-people is a Python package that collects and exports the full GitHub profile for every person associated with a repository — contributors, maintainers, stargazers, watchers, issue/PR authors, fork owners, commit authors and dependents.

Table of Contents


Introduction

repo-people provides a single-call pipeline to collect every GitHub user associated with a repository across 9 role categories, fetch 30+ profile fields for each person from the GitHub API, and export the results to JSON, CSV, or Markdown. It is designed for research, open-source community analysis, and developer intelligence workflows.

Key capabilities:

  • Collects users from 9 role categories in a single call
  • Fetches 30+ profile fields per user (bio, location, company, followers, orgs, languages, …)
  • Computes derived metrics: account age, followers/following ratio, repos/year, recently-active flag, bot detection
  • Incremental fetch with save_each_iteration and resume — safe to interrupt and restart on large repos
  • Flexible filtering: roles, exclude, exclude_bots, limit, fields
  • Concurrent fetching via workers — uses ThreadPoolExecutor to fetch multiple profiles in parallel
  • Async fetching via get_users_async() — uses asyncio + aiohttp for high-concurrency scenarios
  • Opt-in social accounts via include_social_accounts — fetches linked LinkedIn, Mastodon, npm, and other accounts
  • Export to JSON, CSV and Markdown table
  • Analysis helpers: summarise() and top_users()
  • Token validated on startup — invalid or expired tokens raise ConnectionError immediately
  • Rate-limit progress printed every 50 users with remaining request count and reset time

Background

Understanding who contributes to, uses, and maintains an open-source project is valuable for community health analysis, academic research, and competitive intelligence. GitHub exposes this information across many endpoints (contributors, stargazers, watchers, forks, issues, pull requests, CODEOWNERS, commit history), but collecting and joining it requires many paginated API calls.

repo-people automates that collection, deduplicates users across all roles, enriches each record with the full GitHub profile, and computes additional signals (account age, activity recency, bot detection) in a single pipeline call.


Requirements

  • Python ^3.9
  • PyGithub ^2.0.0 — GitHub API client
  • requests ^2.31.0 — HTTP requests for REST endpoints
  • beautifulsoup4 ^4.12.0 — HTML scraping for dependents
  • aiohttp ^3.9 — async HTTP client for get_users_async()

A GitHub personal access token is strongly recommended. Unauthenticated requests are limited to 60/hour; authenticated requests allow 5,000/hour.


Installation

Install the latest version of pySAR via [PyPi][PyPi] using pip:

pip3 install pysar --upgrade

Installation from source:

git clone -b master https://github.com/amckenna41/pySAR.git
cd pySAR
pip3 install .

Documentation

  • Read the Docs — full package documentation
  • FIELDS.md — full reference table of all 48 output fields with descriptions
  • CHANGELOG.md — version history and release notes

Usage

Quick Start

from repo_people import RepoPeople

rp = RepoPeople("owner", "repo", token="ghp_...")
user_data = rp.get_users(export=True)
# Returns a dict keyed by username, with 30+ profile fields per user

Authentication

import os
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])

The token is validated immediately on construction — an invalid or expired token raises ConnectionError before any collection begins.

RepoPeople() Constructor

RepoPeople(owner, repo, token=None, outdir=None, skip_codeowners=False, skip_collaborators=False)
Parameter Type Default Description
owner str GitHub username or organisation that owns the repo.
repo str Repository name.
token str | None None Personal access token. Strongly recommended — validated immediately on init; raises ConnectionError for invalid tokens.
outdir str | None "{owner}_{repo}" Leaf directory inside outputs/. All output files are written under outputs/{outdir}/.
skip_codeowners bool False Skip CODEOWNERS file when collecting maintainers.
skip_collaborators bool False Skip repo collaborators when collecting maintainers.

get_users() Parameters

Parameter Type Default Description
export bool False Write results to a JSON file.
export_csv bool False Write results to a CSV file.
save_each_iteration bool False Save after every single user fetch.
limit int | None None Cap the number of profiles to fetch.
roles list[str] | None None (all 9) Restrict which roles to collect.
exclude list[str] | None None Usernames to skip.
exclude_bots bool False Skip bot accounts automatically.
resume bool False Skip users already in the output file.
verbose bool True Print progress to stdout.
fields list[str] | str | None None (all) Restrict which fields appear in output. Invalid names raise ValueError before any fetch.
include_social_accounts bool False Fetch each user's linked social accounts (LinkedIn, Mastodon, npm, …). Costs one extra API call per user.
workers int 1 Number of concurrent fetch threads. Increase for faster collection on large repos.

Valid roles values: contributors, maintainers, stargazers, watchers, issue_authors, pr_authors, fork_owners, commit_authors, dependents.

Examples

Filter by role

# Only gather contributors and stargazers
user_data = rp.get_users(roles=["contributors", "stargazers"])

Limit, exclude, and skip bots

user_data = rp.get_users(
    limit=100,
    exclude=["dependabot", "github-actions[bot]"],
    exclude_bots=True,
)

Export to JSON and CSV

user_data = rp.get_users(export=True, export_csv=True)

Export to Markdown table

rp.export_to_markdown(user_data, fields=["login", "name", "location", "followers"])

Resume an interrupted run

# First run
rp.get_users(save_each_iteration=True, export=True)

# Resume after interruption
rp.get_users(save_each_iteration=True, export=True, resume=True)

Concurrent fetching

# Speed up large repos by fetching profiles in parallel
user_data = rp.get_users(workers=4)

Async fetching

import asyncio

user_data = asyncio.run(rp.get_users_async(concurrency=10))

Include social accounts

user_data = rp.get_users(include_social_accounts=True)
# Each record gains a 'social_accounts' dict, e.g. {'linkedin': 'https://linkedin.com/in/...'}

Analysis helpers

stats = rp.summarise(user_data, top_n=5)
# {'total': 134, 'top_locations': [('San Francisco', 18), ...], ...}

leaders = rp.top_users(user_data, n=10, by="followers")

Output Fields

Each user entry contains 30+ fields. See FIELDS.md for the full reference. A summary by category:

Category Fields
Identity login, name, company, location, email_public, blog, twitter, bio
Timestamps created_at, updated_at
Counters followers, following, public_repos, public_gists
Flags has_public_email, has_blog, has_twitter, is_bot, hireable
Computed account_age_days, followers_following_ratio, repos_per_year, recently_active, last_public_event_at
Organisations public_orgs, orgs_public_count
Sampled top_languages, total_public_stars_sampled, total_public_forks_sampled, ssh_keys_count, gpg_keys_count, starred_repos_sampled
Social social_accounts (opt-in via include_social_accounts)
Repo-specific is_collaborator, permission_on_repo
Metadata roles (populated by get_users())

Directories

repo-people/
├── repo_people/          # Package source
│   ├── __init__.py
│   ├── repo_people.py    # RepoPeople class — main pipeline
│   ├── export.py         # Role-specific username collectors (9 functions)
│   ├── users.py          # GitHubUserInfo wrapper and UserSnapshot dataclass
│   └── utils.py          # Shared helpers: paginate(), _headers(), write_csv()
├── tests/                # Unit and integration tests
│   ├── test_repo_people.py
│   ├── test_export.py
│   └── test_users.py
├── docs/                 # Sphinx documentation source
├── outputs/              # Default output directory (created at runtime)
├── FIELDS.md             # Full output field reference
├── CHANGELOG.md          # Version history
├── pyproject.toml        # Package metadata and dependencies
└── README.md

Output Fields

Each user entry contains 30+ fields including:

Category Fields
Identity login, name, company, location, email_public, blog, twitter, bio
Timestamps created_at, updated_at
Counters followers, following, public_repos, public_gists
Flags has_public_email, has_blog, has_twitter, is_bot, hireable
Computed account_age_days, followers_following_ratio, repos_per_year, recently_active, last_public_event_at
Organisations public_orgs, orgs_public_count
Sampled top_languages, total_public_stars_sampled, total_public_forks_sampled, ssh_keys_count, gpg_keys_count, starred_repos_sampled
Social social_accounts (opt-in via include_social_accounts)
Repo-specific is_collaborator, permission_on_repo
Metadata roles (populated by get_users())

Issues

Bugs and feature requests are tracked on GitHub Issues.

When reporting a bug, please include:

  • Python version (python --version)
  • Package version (pip show repo-people)
  • A minimal code snippet that reproduces the issue
  • The full traceback if an exception is raised

License

Distributed under the MIT License. See MIT for more details.

Contact

AJ McKennaamckenna41@qub.ac.uk


Star it on GitHub

Buy Me A Coffee

Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repo_people-0.1.0.tar.gz (26.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repo_people-0.1.0-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file repo_people-0.1.0.tar.gz.

File metadata

  • Download URL: repo_people-0.1.0.tar.gz
  • Upload date:
  • Size: 26.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for repo_people-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7cfe0be4c88d2d5e8def549e2bf2d37d020591cd2a1ea2b1d55cd92e4c6c94fe
MD5 5019e7a5f05232ab299ef39168db91e4
BLAKE2b-256 eae90f2cfdfce0de656939ca40f8786e92bbc25a0222e20418c09a0e39ff858a

See more details on using hashes here.

File details

Details for the file repo_people-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: repo_people-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for repo_people-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b610e6c14a730a2e33892c2502325d32f97652b799fa292472f91cb5157dbed2
MD5 daeadcb2f117a0babcccf0e9e220826c
BLAKE2b-256 cd407d018955be37f9ec191b6fd664343cf224ee2203f2f8e217b4b8438e8006

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page