Collect and export full GitHub user profile data for everyone associated with a repository.
Project description
repo-people
repo-people is a Python package that collects and exports the full GitHub profile for every person associated with a repository — contributors, maintainers, stargazers, watchers, issue/PR authors, fork owners, commit authors and dependents.
Table of Contents
- Introduction
- Background
- Requirements
- Installation
- Documentation
- Usage
- Directories
- Issues
- License
- Contact
Introduction
repo-people provides a single-call pipeline to collect every GitHub user associated with a repository across 9 role categories, fetch 30+ profile fields for each person from the GitHub API, and export the results to JSON, CSV, or Markdown. It is designed for research, open-source community analysis, and developer intelligence workflows.
Key capabilities:
- Collects users from 9 role categories in a single call
- Fetches 30+ profile fields per user (bio, location, company, followers, orgs, languages, …)
- Computes derived metrics: account age, followers/following ratio, repos/year, recently-active flag, bot detection
- Incremental fetch with
save_each_iterationandresume— safe to interrupt and restart on large repos - Flexible filtering:
roles,exclude,exclude_bots,limit,fields - Concurrent fetching via
workers— usesThreadPoolExecutorto fetch multiple profiles in parallel - Async fetching via
get_users_async()— usesasyncio+aiohttpfor high-concurrency scenarios - Opt-in social accounts via
include_social_accounts— fetches linked LinkedIn, Mastodon, npm, and other accounts - Export to JSON, CSV and Markdown table
- Analysis helpers:
summarise()andtop_users() - Token validated on startup — invalid or expired tokens raise
ConnectionErrorimmediately - Rate-limit progress printed every 50 users with remaining request count and reset time
Background
Understanding who contributes to, uses, and maintains an open-source project is valuable for community health analysis, academic research, and competitive intelligence. GitHub exposes this information across many endpoints (contributors, stargazers, watchers, forks, issues, pull requests, CODEOWNERS, commit history), but collecting and joining it requires many paginated API calls.
repo-people automates that collection, deduplicates users across all roles, enriches each record with the full GitHub profile, and computes additional signals (account age, activity recency, bot detection) in a single pipeline call.
Requirements
- Python ^3.9
- PyGithub ^2.0.0 — GitHub API client
- requests ^2.31.0 — HTTP requests for REST endpoints
- beautifulsoup4 ^4.12.0 — HTML scraping for dependents
- aiohttp ^3.9 — async HTTP client for
get_users_async()
A GitHub personal access token is strongly recommended. Unauthenticated requests are limited to 60/hour; authenticated requests allow 5,000/hour.
Installation
Install the latest version of repo-people via PyPi using pip:
pip3 install repo-people --upgrade
Installation from source:
git clone -b main https://github.com/amckenna41/repo-people.git
cd repo-people
pip3 install .
Documentation
- Read the Docs — full package documentation
- FIELDS.md — full reference table of all 48 output fields with descriptions
- CHANGELOG.md — version history and release notes
Usage
Quick Start
How to get a GitHub Personal Access Token
- Sign in to github.com and go to Settings → Developer settings → Personal access tokens → Tokens (classic).
- Click Generate new token (classic).
- Give the token a descriptive name and set an expiration date.
- Select the following scopes:
repo— read access to repository metadata, contributors, and collaboratorsread:user— read user profile dataread:org— read organisation membership (needed forpublic_orgs)
- Click Generate token and copy it immediately — it won't be shown again.
- Store it securely (e.g. in an environment variable or a secrets manager) and pass it via the
tokenparameter:
import os
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])
Tip: Unauthenticated requests are limited to 60/hour. Authenticated requests allow 5,000/hour, making a token essential for any non-trivial repo.
from repo_people import RepoPeople
rp = RepoPeople("owner", "repo", token="ghp_...")
user_data = rp.get_users(export=True)
# Returns a dict keyed by username, with 30+ profile fields per user
Authentication
import os
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])
The token is validated immediately on construction — an invalid or expired token raises ConnectionError before any collection begins.
RepoPeople() Constructor
RepoPeople(owner, repo, token=None, outdir=None, skip_codeowners=False, skip_collaborators=False)
| Parameter | Type | Default | Description |
|---|---|---|---|
owner |
str |
— | GitHub username or organisation that owns the repo. |
repo |
str |
— | Repository name. |
token |
str | None |
None |
Personal access token. Strongly recommended — validated immediately on init; raises ConnectionError for invalid tokens. |
outdir |
str | None |
"{owner}_{repo}" |
Leaf directory inside outputs/. All output files are written under outputs/{outdir}/. |
skip_codeowners |
bool |
False |
Skip CODEOWNERS file when collecting maintainers. |
skip_collaborators |
bool |
False |
Skip repo collaborators when collecting maintainers. |
get_users() Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
export |
bool |
False |
Write results to a JSON file. |
export_csv |
bool |
False |
Write results to a CSV file. |
save_each_iteration |
bool |
False |
Save after every single user fetch. |
limit |
int | None |
None |
Cap the number of profiles to fetch. |
roles |
list[str] | None |
None (all 9) |
Restrict which roles to collect. |
exclude |
list[str] | None |
None |
Usernames to skip. |
exclude_bots |
bool |
False |
Skip bot accounts automatically. |
resume |
bool |
False |
Skip users already in the output file. |
verbose |
bool |
True |
Print progress to stdout. |
fields |
list[str] | str | None |
None (all) |
Restrict which fields appear in output. Invalid names raise ValueError before any fetch. |
include_social_accounts |
bool |
False |
Fetch each user's linked social accounts (LinkedIn, Mastodon, npm, …). Costs one extra API call per user. |
workers |
int |
1 |
Number of concurrent fetch threads. Increase for faster collection on large repos. |
Valid roles values: contributors, maintainers, stargazers, watchers, issue_authors, pr_authors, fork_owners, commit_authors, dependents.
Examples
Filter by role
# Only gather contributors and stargazers
user_data = rp.get_users(roles=["contributors", "stargazers"])
Limit, exclude, and skip bots
user_data = rp.get_users(
limit=100,
exclude=["dependabot", "github-actions[bot]"],
exclude_bots=True,
)
Export to JSON and CSV
user_data = rp.get_users(export=True, export_csv=True)
Export to Markdown table
rp.export_to_markdown(user_data, fields=["login", "name", "location", "followers"])
Resume an interrupted run
# First run
rp.get_users(save_each_iteration=True, export=True)
# Resume after interruption
rp.get_users(save_each_iteration=True, export=True, resume=True)
Concurrent fetching
# Speed up large repos by fetching profiles in parallel
user_data = rp.get_users(workers=4)
Async fetching
import asyncio
user_data = asyncio.run(rp.get_users_async(concurrency=10))
Include social accounts
user_data = rp.get_users(include_social_accounts=True)
# Each record gains a 'social_accounts' dict, e.g. {'linkedin': 'https://linkedin.com/in/...'}
Dot-notation field access
get_users() returns a UserDataView — a plain dict subclass that additionally supports dot notation to extract a single field across every user at once:
user_data = rp.get_users()
# Extract one field for all users
emails = user_data.email_public
# {"alice": {"email_public": "alice@example.com"}, "bob": {"email_public": ""}, ...}
locations = user_data.location
followers = user_data.followers
roles = user_data.roles
All standard dict operations still work unchanged. Accessing an unrecognised field name raises AttributeError listing the valid field names.
Analysis helpers
stats = rp.summarise(user_data, top_n=5)
# {'total': 134, 'top_locations': [('San Francisco', 18), ...], ...}
leaders = rp.top_users(user_data, n=10, by="followers")
Output Fields
Each user entry contains 30+ fields. See FIELDS.md for the full reference. A summary by category:
| Category | Fields |
|---|---|
| Identity | login, name, company, location, email_public, blog, twitter, bio |
| Timestamps | created_at, updated_at |
| Counters | followers, following, public_repos, public_gists |
| Flags | has_public_email, has_blog, has_twitter, is_bot, hireable |
| Computed | account_age_days, followers_following_ratio, repos_per_year, recently_active, last_public_event_at |
| Organisations | public_orgs, orgs_public_count |
| Sampled | top_languages, total_public_stars_sampled, total_public_forks_sampled, ssh_keys_count, gpg_keys_count, starred_repos_sampled |
| Social | social_accounts (opt-in via include_social_accounts) |
| Repo-specific | is_collaborator, permission_on_repo |
| Metadata | roles (populated by get_users()) |
Directories
repo-people/
├── repo_people/ # Package source
│ ├── __init__.py
│ ├── repo_people.py # RepoPeople class — main pipeline
│ ├── export.py # Role-specific username collectors (9 functions)
│ ├── users.py # GitHubUserInfo wrapper and UserSnapshot dataclass
│ └── utils.py # Shared helpers: paginate(), _headers(), write_csv()
├── tests/ # Unit and integration tests
│ ├── test_repo_people.py
│ ├── test_export.py
│ └── test_users.py
├── docs/ # Sphinx documentation source
├── outputs/ # Default output directory (created at runtime)
├── FIELDS.md # Full output field reference
├── CHANGELOG.md # Version history
├── pyproject.toml # Package metadata and dependencies
└── README.md
Issues
Any issues, errors or bugs can be raised via the Issues tab in the repository.
Contact
If you have any questions or comments, please contact amckenna41@qub.ac.uk or raise an issue on the Issues tab.
License
Distributed under the MIT License. See LICENSE for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file repo_people-0.2.0.tar.gz.
File metadata
- Download URL: repo_people-0.2.0.tar.gz
- Upload date:
- Size: 29.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93bf7054ad97a5f293b5167386faa3d9385cdbf5a97b85225f2315e010b0f297
|
|
| MD5 |
43b08e5420ea69c17a0988cb1f423640
|
|
| BLAKE2b-256 |
e2ca1f6d5ecd1c72c7d4010b5072649553484fdd9f4f22a4f12be3d8b6946bbf
|
File details
Details for the file repo_people-0.2.0-py3-none-any.whl.
File metadata
- Download URL: repo_people-0.2.0-py3-none-any.whl
- Upload date:
- Size: 29.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f2e9e22d97105611a714692c8bb75f4a6f3238ab924f62941deac478a3a7d8c
|
|
| MD5 |
22ae56a33c8bc2a8e308d917f6519930
|
|
| BLAKE2b-256 |
7c284d3aa01b2d6cd8f828221b80ad886b402777538f367c689f7af47db82b91
|