Collect and export full GitHub user profile data for everyone associated with a repository.
Project description
repo-people
repo-people is a Python package that collects and exports the full GitHub profile for every person associated with a repository — contributors, maintainers, stargazers, watchers, issue/PR authors, fork owners, commit authors and dependents.
Table of Contents
- Introduction
- Background
- Requirements
- Installation
- Documentation
- Usage
- Directories
- Issues
- License
- Contact
Introduction
repo-people provides a single-call pipeline to collect every GitHub user associated with a repository across 9 role categories, fetch 30+ profile fields for each person from the GitHub API, and export the results to JSON, CSV, or Markdown. It is designed for research, open-source community analysis, and developer intelligence workflows.
Key capabilities:
- Collects users from 9 role categories in a single call
- Fetches 30+ profile fields per user (bio, location, company, followers, orgs, languages, …)
- Computes derived metrics: account age, followers/following ratio, repos/year, recently-active flag, bot detection
- Incremental fetch with
save_each_iterationandresume— safe to interrupt and restart on large repos - Flexible filtering:
roles,exclude,exclude_bots,limit,fields - Concurrent fetching via
workers— usesThreadPoolExecutorto fetch multiple profiles in parallel - Async fetching via
get_users_async()— usesasyncio+aiohttpfor high-concurrency scenarios - Opt-in social accounts via
include_social_accounts— fetches linked LinkedIn, Mastodon, npm, and other accounts - Export to JSON, CSV and Markdown table
- Analysis helpers:
summarise()andtop_users() - Token validated on startup — invalid or expired tokens raise
ConnectionErrorimmediately - Rate-limit progress printed every 50 users with remaining request count and reset time
Background
Understanding who contributes to, uses, and maintains an open-source project is valuable for community health analysis, academic research, and competitive intelligence. GitHub exposes this information across many endpoints (contributors, stargazers, watchers, forks, issues, pull requests, CODEOWNERS, commit history), but collecting and joining it requires many paginated API calls.
repo-people automates that collection, deduplicates users across all roles, enriches each record with the full GitHub profile, and computes additional signals (account age, activity recency, bot detection) in a single pipeline call.
Requirements
- Python ^3.9
- PyGithub ^2.0.0 — GitHub API client
- requests ^2.31.0 — HTTP requests for REST endpoints
- beautifulsoup4 ^4.12.0 — HTML scraping for dependents
- aiohttp ^3.9 — async HTTP client for
get_users_async()
A GitHub personal access token is strongly recommended. Unauthenticated requests are limited to 60/hour; authenticated requests allow 5,000/hour.
Installation
Install the latest version of pySAR via [PyPi][PyPi] using pip:
pip3 install pysar --upgrade
Installation from source:
git clone -b master https://github.com/amckenna41/pySAR.git
cd pySAR
pip3 install .
Documentation
- Read the Docs — full package documentation
- FIELDS.md — full reference table of all 48 output fields with descriptions
- CHANGELOG.md — version history and release notes
Usage
Quick Start
from repo_people import RepoPeople
rp = RepoPeople("owner", "repo", token="ghp_...")
user_data = rp.get_users(export=True)
# Returns a dict keyed by username, with 30+ profile fields per user
Authentication
import os
rp = RepoPeople("owner", "repo", token=os.environ["GITHUB_TOKEN"])
The token is validated immediately on construction — an invalid or expired token raises ConnectionError before any collection begins.
RepoPeople() Constructor
RepoPeople(owner, repo, token=None, outdir=None, skip_codeowners=False, skip_collaborators=False)
| Parameter | Type | Default | Description |
|---|---|---|---|
owner |
str |
— | GitHub username or organisation that owns the repo. |
repo |
str |
— | Repository name. |
token |
str | None |
None |
Personal access token. Strongly recommended — validated immediately on init; raises ConnectionError for invalid tokens. |
outdir |
str | None |
"{owner}_{repo}" |
Leaf directory inside outputs/. All output files are written under outputs/{outdir}/. |
skip_codeowners |
bool |
False |
Skip CODEOWNERS file when collecting maintainers. |
skip_collaborators |
bool |
False |
Skip repo collaborators when collecting maintainers. |
get_users() Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
export |
bool |
False |
Write results to a JSON file. |
export_csv |
bool |
False |
Write results to a CSV file. |
save_each_iteration |
bool |
False |
Save after every single user fetch. |
limit |
int | None |
None |
Cap the number of profiles to fetch. |
roles |
list[str] | None |
None (all 9) |
Restrict which roles to collect. |
exclude |
list[str] | None |
None |
Usernames to skip. |
exclude_bots |
bool |
False |
Skip bot accounts automatically. |
resume |
bool |
False |
Skip users already in the output file. |
verbose |
bool |
True |
Print progress to stdout. |
fields |
list[str] | str | None |
None (all) |
Restrict which fields appear in output. Invalid names raise ValueError before any fetch. |
include_social_accounts |
bool |
False |
Fetch each user's linked social accounts (LinkedIn, Mastodon, npm, …). Costs one extra API call per user. |
workers |
int |
1 |
Number of concurrent fetch threads. Increase for faster collection on large repos. |
Valid roles values: contributors, maintainers, stargazers, watchers, issue_authors, pr_authors, fork_owners, commit_authors, dependents.
Examples
Filter by role
# Only gather contributors and stargazers
user_data = rp.get_users(roles=["contributors", "stargazers"])
Limit, exclude, and skip bots
user_data = rp.get_users(
limit=100,
exclude=["dependabot", "github-actions[bot]"],
exclude_bots=True,
)
Export to JSON and CSV
user_data = rp.get_users(export=True, export_csv=True)
Export to Markdown table
rp.export_to_markdown(user_data, fields=["login", "name", "location", "followers"])
Resume an interrupted run
# First run
rp.get_users(save_each_iteration=True, export=True)
# Resume after interruption
rp.get_users(save_each_iteration=True, export=True, resume=True)
Concurrent fetching
# Speed up large repos by fetching profiles in parallel
user_data = rp.get_users(workers=4)
Async fetching
import asyncio
user_data = asyncio.run(rp.get_users_async(concurrency=10))
Include social accounts
user_data = rp.get_users(include_social_accounts=True)
# Each record gains a 'social_accounts' dict, e.g. {'linkedin': 'https://linkedin.com/in/...'}
Analysis helpers
stats = rp.summarise(user_data, top_n=5)
# {'total': 134, 'top_locations': [('San Francisco', 18), ...], ...}
leaders = rp.top_users(user_data, n=10, by="followers")
Output Fields
Each user entry contains 30+ fields. See FIELDS.md for the full reference. A summary by category:
| Category | Fields |
|---|---|
| Identity | login, name, company, location, email_public, blog, twitter, bio |
| Timestamps | created_at, updated_at |
| Counters | followers, following, public_repos, public_gists |
| Flags | has_public_email, has_blog, has_twitter, is_bot, hireable |
| Computed | account_age_days, followers_following_ratio, repos_per_year, recently_active, last_public_event_at |
| Organisations | public_orgs, orgs_public_count |
| Sampled | top_languages, total_public_stars_sampled, total_public_forks_sampled, ssh_keys_count, gpg_keys_count, starred_repos_sampled |
| Social | social_accounts (opt-in via include_social_accounts) |
| Repo-specific | is_collaborator, permission_on_repo |
| Metadata | roles (populated by get_users()) |
Directories
repo-people/
├── repo_people/ # Package source
│ ├── __init__.py
│ ├── repo_people.py # RepoPeople class — main pipeline
│ ├── export.py # Role-specific username collectors (9 functions)
│ ├── users.py # GitHubUserInfo wrapper and UserSnapshot dataclass
│ └── utils.py # Shared helpers: paginate(), _headers(), write_csv()
├── tests/ # Unit and integration tests
│ ├── test_repo_people.py
│ ├── test_export.py
│ └── test_users.py
├── docs/ # Sphinx documentation source
├── outputs/ # Default output directory (created at runtime)
├── FIELDS.md # Full output field reference
├── CHANGELOG.md # Version history
├── pyproject.toml # Package metadata and dependencies
└── README.md
Output Fields
Each user entry contains 30+ fields including:
| Category | Fields |
|---|---|
| Identity | login, name, company, location, email_public, blog, twitter, bio |
| Timestamps | created_at, updated_at |
| Counters | followers, following, public_repos, public_gists |
| Flags | has_public_email, has_blog, has_twitter, is_bot, hireable |
| Computed | account_age_days, followers_following_ratio, repos_per_year, recently_active, last_public_event_at |
| Organisations | public_orgs, orgs_public_count |
| Sampled | top_languages, total_public_stars_sampled, total_public_forks_sampled, ssh_keys_count, gpg_keys_count, starred_repos_sampled |
| Social | social_accounts (opt-in via include_social_accounts) |
| Repo-specific | is_collaborator, permission_on_repo |
| Metadata | roles (populated by get_users()) |
Issues
Bugs and feature requests are tracked on GitHub Issues.
When reporting a bug, please include:
- Python version (
python --version) - Package version (
pip show repo-people) - A minimal code snippet that reproduces the issue
- The full traceback if an exception is raised
License
Distributed under the MIT License. See MIT for more details.
Contact
AJ McKenna — amckenna41@qub.ac.uk
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file repo_people-0.1.0.tar.gz.
File metadata
- Download URL: repo_people-0.1.0.tar.gz
- Upload date:
- Size: 26.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cfe0be4c88d2d5e8def549e2bf2d37d020591cd2a1ea2b1d55cd92e4c6c94fe
|
|
| MD5 |
5019e7a5f05232ab299ef39168db91e4
|
|
| BLAKE2b-256 |
eae90f2cfdfce0de656939ca40f8786e92bbc25a0222e20418c09a0e39ff858a
|
File details
Details for the file repo_people-0.1.0-py3-none-any.whl.
File metadata
- Download URL: repo_people-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b610e6c14a730a2e33892c2502325d32f97652b799fa292472f91cb5157dbed2
|
|
| MD5 |
daeadcb2f117a0babcccf0e9e220826c
|
|
| BLAKE2b-256 |
cd407d018955be37f9ec191b6fd664343cf224ee2203f2f8e217b4b8438e8006
|