Discover the geography of open-source software. Explore the geographic locations of software developers associated with a GitHub repository or a Python (PyPI) package.
Reason this release was yanked:
Initial PyPi upload had a big bug, recommend avoiding this version
Project description
GitGeo
Discover the geography of open-source software. Explore the geographic locations of software developers associated with a GitHub repository or a Python (PyPI) package.
See, for instance, the geography of the contributors to the Python package requests.
Why use GitGeo?
Curiosity
Open source software community management
Research on open source software ecosystems
IT security compliance
Installation
git clone https://github.com/IQTLabs/GitGeo
cd GitGeo
python setup.py develop
Usage
(requires internet connection)
First, create one or more GitHub personal access tokens.
Second, run these commands in the command line to set environmental variables:
export GITHUB_USERNAME='[github_username]' export GITHUB_TOKEN='[github_token]'
Alternatively, to use multiple tokens, create a file called tokens.txt in the code’s directory and enter a GitHub personal access token on each line.
Third, run these commands in the command line:
gitgeo --package [package_name]
gitgeo --repo [github_repo_url]
For example:
>>> gitgeo --package requests
-----------------
PACKAGE: requests
-----------------
CONTRIBUTOR, LOCATION
* indicates PyPI maintainer
---------------------
kennethreitz42 | Virginia, USA
Lukasa * | London, England
sigmavirus24 | Madison, WI
nateprewitt * | None
slingamn | None
BraulioVM | Malaga & Granada, Spain
dpursehouse | Kawasaki
jgorset | Oslo, Norway
...
Or:
>>> gitgeo --repo www.github.com/psf/requests
-----------------
GITHUB REPO: psf/requests
-----------------
CONTRIBUTOR, LOCATION
---------------------
kennethreitz42 | Virginia, USA | United States
Lukasa | London, England | United Kingdom
sigmavirus24 | Madison, WI | United States
nateprewitt | None | None
...
There are other command line options too:
Add --summary to get the results summarized by country. e.g.
>>> gitgeo --package requests --summary
-----------------
PACKAGE: requests
GITHUB REPO: psf/requests
-----------------
COUNTRY | # OF CONTRIBUTORS
---------------------------
United States 37
None 23
United Kingdom 4
Canada 4
Germany 4
Switzerland 4
Spain 2
Russia 2
...
Add --map when using the --repo option to create an html map saved in the results folder. See image above for static example. Real map includes zooming and tooltip capability.
Add --ouput_csv to output csv of results to results folder.
To create a csv of contributors from many repositories, enter repositories on separate lines in the repos.txt file. Then use the --multirepo flag.
Add multirepo_map and then a filename to create a map of csv ouput. csv output must be located in the results folder.
Add --num and specify a multiple of 100 from 100 (default) to 500 to specify the number of contributors analyzed per repo.
Run tests:
pytest
Roadmap
Investigate capability of predicting location via a model given only timestamp from commit and commit-related data. (Kinga)
Investigate GitHub API for examining merges and who has merge rights.
Add capability of reading through commits and, specifically, (1) determine if GitHub commit rights can be inferred.
Investigate capability of extracting all users associated with a GitHub group
Investigate capability to determine authenticity of location information
Investigate possibility of geographic diversity score for a repo or package
Investigate possibility of linking emails in commits to email breach lists.
Investigate possibility of determining whether a project is a “hobby” project (outside of working hours) or a “work” project (within working hours)?
Investigate possibility of using NLP to determine codebase specialties of each contributor. e.g. This person is the “auth” person.
Investigate over time commit analysis visualization
Add dump multirepo results (or similar aggregate scan) to s3 capability
Investigate diff to tweet capability. Reveal major contributor changes in critical projects to an open feed.
Investigate switching ownership data. Would be interesting to alert users to this.
Investigate by user capability. Determine all repo’s a user has contributed to. Do a quick git blame for a user.
Rainy Day Options
Access commercial API’s to enrich data on GitHub usernames or, if included in GitHub profile, email handles, etc. Perhaps People Data Labs or Explorium. (MK)
Potential Research Questions
Are there places in the world with unrecognized pockets of software developers?
Where are maintainers associated with the most critical python packages?
Who are the maintainers that are associated with multiple critical python packages?
What about contribution-related weighting?
Where are the maintainers associated with the top GitHub packages by stars? Top data science packages? Quantum computing packages? Blockhcain packages? Etc? (RP)
Then do sub-analysis that asks on what repos or types of repos developers of a given country are most active
What predicts the number of top python packages software developers by country?
Total number of coders per country?
Total number of python coders per country?
GDP per capita per country?
Is it possible to “verify” user information?
Known bugs
Want to contribute?
Open a PR. We are glad to accept pull requests. We use black and pylint and pydocstyle, though we are glad to help if you haven’t used those tools before.
Open an issue. Tell us your problem or a functionality you want.
Want to help build a community related to GitGeo and similar open source software ecosystem exploration tools? Please send an email to jmeyers@iqt.org.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.