A CLI tool for discovering and scraping AI-related high-star repositories from GitHub and GitLab
Project description
GitHub/GitLab AI Scraper
English | 简体中文
A CLI tool for scraping AI-related high-star repositories from GitHub and GitLab.
Features
- Multi-platform support - Scrape from GitHub or GitLab (including self-hosted instances)
- Search and filter AI-related repositories by keywords and topics
- Dynamic keyword extraction - Automatically learns new keywords from scraped repos
- Markdown/HTML/Excel/RSS report generation - Multiple export formats with Chinese translation
- Incremental scraping - Fetch only updated repos with
--sinceflag - Resume support - Continue interrupted scrapes with progress tracking
- Progress bar display - Visual progress during scraping
- Interactive CLI mode - Menu-driven interface for easy use
- Concurrent scraping - Parallel requests for faster results
- Multi-language search - Support for Chinese and English keywords
- Local SQLite storage with trend analysis
- Configurable filtering and scraping options
- Rate limiting with GitHub/GitLab API token support
- Export to CSV/JSON/HTML/Excel/RSS/Markdown formats
- REST API server - Access data via HTTP endpoints with optional authentication
- Scheduled scraping - Cron-based periodic scraping
- Webhook notifications - Notify external services on events
- Plugin system - Extend functionality with custom plugins
- Repository health assessment - Activity, popularity, maintenance scores
- Intelligent classification - LLM, CV, NLP, MLOps, AI Infrastructure categories
- Deduplication - Fork and mirror detection, content similarity
- Secure token storage - Encrypted storage for sensitive tokens
- Database backup - Automatic backup and restore functionality
- Error recovery - Retry logic with exponential backoff
Installation
# Install from PyPI
pip install github-ai-scraper
# Or install from source for development
pip install -e ".[dev]"
Quick Start
# Set your GitHub token (optional, increases rate limit)
export GITHUB_TOKEN=your_token_here
# Scrape AI repositories from GitHub (default)
ai-scraper scrape
# Scrape from GitLab
ai-scraper scrape --platform gitlab
# Scrape from self-hosted GitLab
ai-scraper scrape --platform gitlab --gitlab-url https://your-gitlab.com/api/v4
# Scrape with progress bar
ai-scraper scrape --progress
# Concurrent scraping (faster)
ai-scraper scrape --concurrent
# Incremental scraping (repos updated in last 7 days)
ai-scraper scrape --incremental
ai-scraper scrape --since 7d
# Resume interrupted scrape
ai-scraper scrape --resume
# Interactive mode
ai-scraper interactive
# List scraped repositories
ai-scraper list
# Show trending repositories
ai-scraper trending
# Export data
ai-scraper db export --format html --output index.html
ai-scraper db export --format xlsx --output repos.xlsx
ai-scraper db export --format rss --output feed.xml
ai-scraper db export --format markdown --output repositories.md
# Start REST API server (with authentication)
ai-scraper serve --port 8080 --auth
# Schedule periodic scraping (daily at 9am)
ai-scraper schedule --cron "0 9 * * *"
# Backup database
ai-scraper db backup
ai-scraper db restore backup_file.db.gz
Configuration
Create ai-scraper.yaml to customize:
github:
token: ${GITHUB_TOKEN}
cache_ttl: 3600
gitlab:
token: ${GITLAB_TOKEN} # Optional, for GitLab scraping
base_url: https://gitlab.com/api/v4 # Or your self-hosted GitLab URL
cache_ttl: 3600
filter:
min_stars: 100
keywords:
- ai
- machine-learning
- 人工智能 # Chinese keyword support
topics:
- ai
- deep-learning
scrape:
max_results: 500
concurrency: 5
concurrent_requests: 5
database:
path: ./data/ai_scraper.db
backup_dir: ./backups
max_backups: 10
api:
auth_enabled: true
api_keys:
- as_your_api_key_here
webhooks:
enabled: false
endpoints:
- url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
events: [scrape_complete, trending_found]
Commands
| Command | Description |
|---|---|
ai-scraper scrape |
Scrape AI repositories from GitHub |
ai-scraper scrape --platform gitlab |
Scrape from GitLab |
ai-scraper scrape --platform gitlab --gitlab-url URL |
Scrape from self-hosted GitLab |
ai-scraper scrape --concurrent |
Concurrent scraping for faster results |
ai-scraper scrape --incremental |
Incremental scraping (only updated repos) |
ai-scraper scrape --since 7d |
Fetch repos updated in last 7 days |
ai-scraper scrape --resume |
Resume interrupted scrape |
ai-scraper scrape --progress |
Show progress bar during scraping |
ai-scraper interactive |
Start interactive menu mode |
ai-scraper list |
List scraped repositories |
ai-scraper trending |
Show trending repositories by star growth |
ai-scraper serve |
Start REST API server |
ai-scraper serve --auth |
Start API server with authentication |
ai-scraper schedule |
Schedule periodic scraping |
ai-scraper keywords list |
List all keywords |
ai-scraper keywords extract |
Extract keywords from database |
ai-scraper keywords clear |
Clear keywords |
ai-scraper config init |
Initialize config file |
ai-scraper config show |
Show current config |
ai-scraper db stats |
Show database statistics |
ai-scraper db export |
Export data to CSV/JSON/HTML/Excel/RSS |
ai-scraper db clean --invalid |
Remove repositories with invalid data |
ai-scraper db clean --vacuum |
Optimize database size |
ai-scraper db backup |
Create database backup |
ai-scraper db restore |
Restore from backup |
ai-scraper db backups |
List available backups |
REST API Endpoints
When running ai-scraper serve:
| Endpoint | Description |
|---|---|
GET /api/repos |
List repositories with filters |
GET /api/repos/{id} |
Get specific repository |
GET /api/stats |
Get database statistics |
GET /api/trending |
Get trending repositories |
GET /api/search?q=... |
Search repositories |
Authentication: Pass X-API-Key header when --auth is enabled.
Project Structure
github-ai-scraper/
├── src/ai_scraper/
│ ├── cli.py # CLI entry point
│ ├── config.py # Configuration management
│ ├── interactive.py # Interactive menu mode
│ ├── classifier.py # Repository classification
│ ├── dedup.py # Deduplication utilities
│ ├── health.py # Health assessment
│ ├── scheduler.py # Task scheduling
│ ├── webhooks.py # Webhook notifications
│ ├── plugins.py # Plugin system
│ ├── logging_config.py # Logging configuration
│ ├── api_server.py # REST API server
│ ├── auth.py # API authentication
│ ├── retry.py # Error recovery
│ ├── i18n.py # Multi-language support
│ ├── scrape_progress.py # Resume support
│ ├── backup.py # Database backup
│ ├── config_watcher.py # Config hot reload
│ ├── secure_storage.py # Token encryption
│ ├── api/
│ │ ├── github.py # GitHub API client
│ │ └── rate_limiter.py # Token bucket rate limiter
│ ├── models/
│ │ └── repository.py # Data models (Pydantic)
│ ├── filters/
│ │ └── ai_filter.py # AI relevance filter
│ ├── output/
│ │ ├── markdown.py # Markdown exporter
│ │ ├── html.py # HTML exporter
│ │ ├── excel.py # Excel exporter
│ │ └── rss.py # RSS exporter
│ └── storage/
│ ├── database.py # SQLite storage (sync)
│ └── async_database.py # SQLite storage (async)
├── plugins/ # Example plugins
├── tests/ # Test suite
├── Dockerfile # Docker support
├── docker-compose.yml # Docker compose
├── .github/workflows/ # CI/CD workflows
└── ai-scraper.yaml # Default configuration
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Build Docker image
docker build -t ai-scraper .
API Rate Limits
- Without token: 60 requests/hour
- With token: 5000 requests/hour
Set GITHUB_TOKEN environment variable for higher limits.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file github_ai_scraper-0.1.2.tar.gz.
File metadata
- Download URL: github_ai_scraper-0.1.2.tar.gz
- Upload date:
- Size: 204.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e333c7fde9cd2af15e9cb5bcfe047efe6674b5dcbe3ee138d7b1202ab10952f6
|
|
| MD5 |
42d08bab7524c9679e15c30dc0c63f9f
|
|
| BLAKE2b-256 |
558117fafdafc27cae5e833cb247e867b9290dfcbab01f03edbcf75796137c41
|
File details
Details for the file github_ai_scraper-0.1.2-py3-none-any.whl.
File metadata
- Download URL: github_ai_scraper-0.1.2-py3-none-any.whl
- Upload date:
- Size: 67.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f8aa4f009478a7bc9710c70c302abf3a4161a3188e8d9b60ff48240b6775c6e
|
|
| MD5 |
ac264db060ab8582e8aff7b7a381f39c
|
|
| BLAKE2b-256 |
a265cc7574a9c09a8e41cce4303ec08ea303c398bc56465f86cf22f513ff7745
|