Skip to main content

Collect and export full GitHub user profile data for everyone associated with a repository.

Project description

repo-people

PyPI version Platforms PythonV Documentation Status License: MIT Issues codecov

repo-people logo

repo-people is a Python package that collects and exports the full GitHub profile for every person associated with a repository — contributors, maintainers, stargazers, watchers, issue/PR authors, fork owners, commit authors and dependents.

Table of Contents


Introduction

repo-people provides a single-call pipeline to collect every GitHub user associated with a repository across 9 role categories, fetch 30+ profile fields for each person from the GitHub API, and export the results to JSON, CSV, or Markdown. It is designed for research, open-source community analysis, and developer intelligence workflows.

Key capabilities:

  • Collects users from 9 role categories in a single call
  • Fetches 30+ profile fields per user (bio, location, company, followers, orgs, languages, …)
  • Computes derived metrics: account age, followers/following ratio, repos/year, recently-active flag, bot detection
  • Incremental fetch with save_each_iteration and resume — safe to interrupt and restart on large repos
  • Flexible filtering: roles, exclude, exclude_bots, limit, fields
  • Concurrent fetching via workers — uses ThreadPoolExecutor to fetch multiple profiles in parallel
  • Async fetching via get_users_async() — uses asyncio + aiohttp for high-concurrency scenarios
  • Opt-in social accounts via include_social_accounts — fetches linked LinkedIn, Mastodon, npm, and other accounts
  • Export to JSON, CSV and Markdown table
  • Analysis helpers: summarise() and top_users()
  • Token validated on startup — invalid or expired tokens raise ConnectionError immediately
  • Rate-limit progress printed every 50 users with remaining request count and reset time

Background

Understanding who contributes to, uses, and maintains an open-source project is valuable for community health analysis, academic research, and competitive intelligence. GitHub exposes this information across many endpoints (contributors, stargazers, watchers, forks, issues, pull requests, CODEOWNERS, commit history), but collecting and joining it requires many paginated API calls.

repo-people automates that collection, deduplicates users across all roles, enriches each record with the full GitHub profile, and computes additional signals (account age, activity recency, bot detection) in a single pipeline call.


Requirements

  • Python ^3.9
  • PyGithub ^2.0.0 — GitHub API client
  • requests ^2.31.0 — HTTP requests for REST endpoints
  • beautifulsoup4 ^4.12.0 — HTML scraping for dependents
  • aiohttp ^3.9 — async HTTP client for get_users_async()

A GitHub personal access token is strongly recommended. Unauthenticated requests are limited to 60/hour; authenticated requests allow 5,000/hour.


Installation

Install the latest version of repo-people via PyPi using pip:

pip3 install repo-people --upgrade

Installation from source:

git clone -b main https://github.com/amckenna41/repo-people.git
cd repo-people
pip3 install .

Documentation

  • Read the Docs — full package documentation
  • FIELDS.md — full reference table of all 48 output fields with descriptions
  • CHANGELOG.md — version history and release notes

Usage

Quick Start

How to get a GitHub Personal Access Token

  1. Sign in to github.com and go to SettingsDeveloper settingsPersonal access tokensTokens (classic).
  2. Click Generate new token (classic).
  3. Give the token a descriptive name and set an expiration date.
  4. Select the following scopes:
    • repo — read access to repository metadata, contributors, and collaborators
    • read:user — read user profile data
    • read:org — read organisation membership (needed for public_orgs)
  5. Click Generate token and copy it immediately — it won't be shown again.
  6. Store it securely (e.g. in an environment variable or a secrets manager) and pass it via the token parameter:
import os
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])

Tip: Unauthenticated requests are limited to 60/hour. Authenticated requests allow 5,000/hour, making a token essential for any non-trivial repo.

from repo_people import RepoPeople

rp = RepoPeople("owner", "repo", token="ghp_...")
user_data = rp.get_users(export=True)
# Returns a dict keyed by username, with 30+ profile fields per user

Authentication

import os
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])

The token is validated immediately on construction — an invalid or expired token raises ConnectionError before any collection begins.

RepoPeople() Constructor

RepoPeople(owner, repo, token=None, outdir=None, skip_codeowners=False, skip_collaborators=False)
Parameter Type Default Description
owner str GitHub username or organisation that owns the repo.
repo str Repository name.
token str | None None Personal access token. Strongly recommended — validated immediately on init; raises ConnectionError for invalid tokens.
outdir str | None "{owner}_{repo}" Leaf directory inside outputs/. All output files are written under outputs/{outdir}/.
skip_codeowners bool False Skip CODEOWNERS file when collecting maintainers.
skip_collaborators bool False Skip repo collaborators when collecting maintainers.

get_users() Parameters

Parameter Type Default Description
export bool False Write results to a JSON file.
export_csv bool False Write results to a CSV file.
save_each_iteration bool False Save after every single user fetch.
limit int | None None Cap the number of profiles to fetch.
roles list[str] | None None (all 9) Restrict which roles to collect.
exclude list[str] | None None Usernames to skip.
exclude_bots bool False Skip bot accounts automatically.
resume bool False Skip users already in the output file.
verbose bool True Print progress to stdout.
fields list[str] | str | None None (all) Restrict which fields appear in output. Invalid names raise ValueError before any fetch.
include_social_accounts bool False Fetch each user's linked social accounts (LinkedIn, Mastodon, npm, …). Costs one extra API call per user.
workers int 1 Number of concurrent fetch threads. Increase for faster collection on large repos.

Valid roles values: contributors, maintainers, stargazers, watchers, issue_authors, pr_authors, fork_owners, commit_authors, dependents.

Examples

Filter by role

# Only gather contributors and stargazers
user_data = rp.get_users(roles=["contributors", "stargazers"])

Limit, exclude, and skip bots

user_data = rp.get_users(
    limit=100,
    exclude=["dependabot", "github-actions[bot]"],
    exclude_bots=True,
)

Export to JSON and CSV

user_data = rp.get_users(export=True, export_csv=True)

Export to Markdown table

rp.export_to_markdown(user_data, fields=["login", "name", "location", "followers"])

Resume an interrupted run

# First run
rp.get_users(save_each_iteration=True, export=True)

# Resume after interruption
rp.get_users(save_each_iteration=True, export=True, resume=True)

Concurrent fetching

# Speed up large repos by fetching profiles in parallel
user_data = rp.get_users(workers=4)

Async fetching

import asyncio

user_data = asyncio.run(rp.get_users_async(concurrency=10))

Include social accounts

user_data = rp.get_users(include_social_accounts=True)
# Each record gains a 'social_accounts' dict, e.g. {'linkedin': 'https://linkedin.com/in/...'}

Dot-notation field access

get_users() returns a UserDataView — a plain dict subclass that additionally supports dot notation to extract a single field across every user at once:

user_data = rp.get_users()

# Extract one field for all users
emails    = user_data.email_public
# {"alice": {"email_public": "alice@example.com"}, "bob": {"email_public": ""}, ...}

locations = user_data.location
followers = user_data.followers
roles     = user_data.roles

All standard dict operations still work unchanged. Accessing an unrecognised field name raises AttributeError listing the valid field names.

Analysis helpers

stats = rp.summarise(user_data, top_n=5)
# {'total': 134, 'top_locations': [('San Francisco', 18), ...], ...}

leaders = rp.top_users(user_data, n=10, by="followers")

Output Fields

Each user entry contains 30+ fields. See FIELDS.md for the full reference. A summary by category:

Category Fields
Identity login, name, company, location, email_public, blog, twitter, bio
Timestamps created_at, updated_at
Counters followers, following, public_repos, public_gists
Flags has_public_email, has_blog, has_twitter, is_bot, hireable
Computed account_age_days, followers_following_ratio, repos_per_year, recently_active, last_public_event_at
Organisations public_orgs, orgs_public_count
Sampled top_languages, total_public_stars_sampled, total_public_forks_sampled, ssh_keys_count, gpg_keys_count, starred_repos_sampled
Social social_accounts (opt-in via include_social_accounts)
Repo-specific is_collaborator, permission_on_repo
Metadata roles (populated by get_users())

Directories

repo-people/
├── repo_people/          # Package source
│   ├── __init__.py
│   ├── repo_people.py    # RepoPeople class — main pipeline
│   ├── export.py         # Role-specific username collectors (9 functions)
│   ├── users.py          # GitHubUserInfo wrapper and UserSnapshot dataclass
│   └── utils.py          # Shared helpers: paginate(), _headers(), write_csv()
├── tests/                # Unit and integration tests
│   ├── test_repo_people.py
│   ├── test_export.py
│   └── test_users.py
├── docs/                 # Sphinx documentation source
├── outputs/              # Default output directory (created at runtime)
├── FIELDS.md             # Full output field reference
├── CHANGELOG.md          # Version history
├── pyproject.toml        # Package metadata and dependencies
└── README.md

Issues

Any issues, errors or bugs can be raised via the Issues tab in the repository.

Contact

If you have any questions or comments, please contact amckenna41@qub.ac.uk or raise an issue on the Issues tab.

License

Distributed under the MIT License. See LICENSE for more details.

Star it on GitHub

Buy Me A Coffee

Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repo_people-0.2.0.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repo_people-0.2.0-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file repo_people-0.2.0.tar.gz.

File metadata

  • Download URL: repo_people-0.2.0.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for repo_people-0.2.0.tar.gz
Algorithm Hash digest
SHA256 93bf7054ad97a5f293b5167386faa3d9385cdbf5a97b85225f2315e010b0f297
MD5 43b08e5420ea69c17a0988cb1f423640
BLAKE2b-256 e2ca1f6d5ecd1c72c7d4010b5072649553484fdd9f4f22a4f12be3d8b6946bbf

See more details on using hashes here.

File details

Details for the file repo_people-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: repo_people-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 29.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for repo_people-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f2e9e22d97105611a714692c8bb75f4a6f3238ab924f62941deac478a3a7d8c
MD5 22ae56a33c8bc2a8e308d917f6519930
BLAKE2b-256 7c284d3aa01b2d6cd8f828221b80ad886b402777538f367c689f7af47db82b91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page