Random sampling of GitHub repositories
Project description
RepoRoulette 🎲: Randomly Sample Repositories from GitHub
Spin the wheel and see which GitHub repositories you get!
🚀 Installation
# Using pip
pip install reporoulette
# From source
git clone https://github.com/gojiplus/reporoulette.git
cd reporoulette
pip install -e .
📖 Sampling Methods
RepoRoulette provides three distinct methods for random GitHub repository sampling:
1. 🎯 ID-Based Sampling
Uses GitHub's sequential repository ID system to generate truly random samples by probing random IDs from the valid ID range. The downside of using the method is that the hit rate can be low (as many IDs are invalid, partly because the repo. is private or abandoned, etc.) And any filtering on repo. characteristics must wait till you have the names.
The function will continue to sample till either max_attempts or till n_samples. You can pass the seed for reproducibility.
from reporoulette import IDSampler
# Initialize the sampler
sampler = IDSampler(token="your_github_token")
# Get 50 random repositories
repos = sampler.sample(n_samples=50)
# Print basic stats
print(f"Success rate: {sampler.success_rate:.2f}%")
print(f"Samples collected: {len(repos)}")
2. ⏱️ Temporal Sampling
Randomly selects time points (date/hour combinations) within a specified range and then retrieves repositories updated during those periods.
from reporoulette import TemporalSampler
from datetime import datetime, timedelta
# Define a date range (last 3 months)
end_date = datetime.now()
start_date = end_date - timedelta(days=90)
# Initialize the sampler
sampler = TemporalSampler(
token="your_github_token",
start_date=start_date,
end_date=end_date
)
# Get 100 random repositories
repos = sampler.sample(n_samples=100)
# Get repositories with specific characteristics
filtered_repos = sampler.sample(
n_samples=50,
min_stars=10,
languages=["python", "javascript"]
)
3. 🔍 BigQuery Sampling
Leverages Google BigQuery's GitHub dataset for high-volume, efficient sampling. We provide three methods --- standard sampler, sampling based on the commits table, and sampling based on the hour buckets. The virtue of the first is its simplicity.
from reporoulette import BigQuerySampler
# Initialize the sampler (requires GCP credentials)
sampler = BigQuerySampler(
credentials_path="path/to/credentials.json"
)
# Sample 1,000 repositories created in the last year
repos = sampler.sample(
n_samples=1000,
created_after="2023-01-01",
sample_by="created_at"
)
# Sample repositories with multiple criteria
specialty_repos = sampler.sample(
n_samples=500,
min_stars=100,
min_forks=50,
languages=["rust", "go"],
has_license=True
)
Advantages:
- Handles large sample sizes efficiently
- Powerful filtering and stratification options
- Not limited by GitHub API rate limits
- Access to historical data
Limitations:
- Could be expensive
- Requires Google Cloud Platform account and billing
- Dataset may have a slight delay (typically 24-48 hours)
4. GH Archive Sampler
rom reporoulette.samplers import GHArchiveSampler
sampler = GHArchiveSampler(seed=42)
# Sample repositories using the gh_sampler method directly
# (This is the method implemented by GHArchiveSampler, not the abstract sample method)
repos = sampler.gh_sampler(
n_samples=10, # Number of repositories to sample
hours_to_sample=5, # Sample from 5 random hours
repos_per_hour=3, # Collect up to 3 repos per hour
years_back=3, # Sample from last 3 years
event_types=["PushEvent", "CreateEvent", "PullRequestEvent"] # Types of events to consider
)
# Display the sampled repositories
print(f"Successfully sampled {len(repos)} repositories:\n")
for i, repo in enumerate(repos, 1):
print(f"{i}. {repo['full_name']}")
print(f" URL: {repo['html_url']}")
print(f" Language: {repo.get('language', 'Unknown')}")
print(f" Event: {repo.get('event_type')}")
print(f" Sampled from: {repo.get('sampled_from')}")
print()
📊 Example Use Cases
- Academic Research: Study coding practices across different languages and communities
- Learning Resources: Discover diverse code examples for education
- Data Science: Build datasets for machine learning models about code patterns
- Trend Analysis: Identify emerging technologies and practices
- Security Research: Find vulnerability patterns across repository types
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
🔗 Related Projects
- GHTorrent - GitHub data archive project
- GitHub Archive - Archive of public GitHub timeline
- PyGithub - Python library for the GitHub API
Built with ❤️ by Gojiplus
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file reporoulette-0.1.2.tar.gz.
File metadata
- Download URL: reporoulette-0.1.2.tar.gz
- Upload date:
- Size: 25.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f51dac7ca056fdedfc019549ce3d91c7763108c2a6512cf03c028b3984d51e44
|
|
| MD5 |
ec72be0d9d472350cbfa85bfdb6e3a1e
|
|
| BLAKE2b-256 |
2dda4a4a9fc0dfd039782c815a31b62083edac332c0b7b5935cafce2a6ddd224
|
File details
Details for the file reporoulette-0.1.2-py3-none-any.whl.
File metadata
- Download URL: reporoulette-0.1.2-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86e37de47ceb931c454eb7f687addcec1dc278751fdba100b2e666f6e3a2dbf6
|
|
| MD5 |
e68ac01b5fec90eee8589c40173066d1
|
|
| BLAKE2b-256 |
33a46f0253c6d2099b730e9e22d8349ea7857ac76d5c2244b493c00cb2726b6b
|