Skip to main content

A CLI tool for discovering and scraping AI-related high-star repositories from GitHub and GitLab

Project description

GitHub/GitLab AI Scraper

English | 简体中文

A CLI tool for scraping AI-related high-star repositories from GitHub and GitLab.

Features

  • Multi-platform support - Scrape from GitHub or GitLab (including self-hosted instances)
  • Search and filter AI-related repositories by keywords and topics
  • Dynamic keyword extraction - Automatically learns new keywords from scraped repos
  • Markdown/HTML/Excel/RSS report generation - Multiple export formats with Chinese translation
  • Incremental scraping - Fetch only updated repos with --since flag
  • Resume support - Continue interrupted scrapes with progress tracking
  • Progress bar display - Visual progress during scraping
  • Interactive CLI mode - Menu-driven interface for easy use
  • Concurrent scraping - Parallel requests for faster results
  • Multi-language search - Support for Chinese and English keywords
  • Local SQLite storage with trend analysis
  • Configurable filtering and scraping options
  • Rate limiting with GitHub/GitLab API token support
  • Export to CSV/JSON/HTML/Excel/RSS/Markdown formats
  • REST API server - Access data via HTTP endpoints with optional authentication
  • Scheduled scraping - Cron-based periodic scraping
  • Webhook notifications - Notify external services on events
  • Plugin system - Extend functionality with custom plugins
  • Repository health assessment - Activity, popularity, maintenance scores
  • Intelligent classification - LLM, CV, NLP, MLOps, AI Infrastructure categories
  • Deduplication - Fork and mirror detection, content similarity
  • Secure token storage - Encrypted storage for sensitive tokens
  • Database backup - Automatic backup and restore functionality
  • Error recovery - Retry logic with exponential backoff

Installation

# Install from PyPI
pip install github-ai-scraper

# Or install from source for development
pip install -e ".[dev]"

Quick Start

# Set your GitHub token (optional, increases rate limit)
export GITHUB_TOKEN=your_token_here

# Scrape AI repositories from GitHub (default)
ai-scraper scrape

# Scrape from GitLab
ai-scraper scrape --platform gitlab

# Scrape from self-hosted GitLab
ai-scraper scrape --platform gitlab --gitlab-url https://your-gitlab.com/api/v4

# Scrape with progress bar
ai-scraper scrape --progress

# Concurrent scraping (faster)
ai-scraper scrape --concurrent

# Incremental scraping (repos updated in last 7 days)
ai-scraper scrape --incremental
ai-scraper scrape --since 7d

# Resume interrupted scrape
ai-scraper scrape --resume

# Interactive mode
ai-scraper interactive

# List scraped repositories
ai-scraper list

# Show trending repositories
ai-scraper trending

# Export data
ai-scraper db export --format html --output index.html
ai-scraper db export --format xlsx --output repos.xlsx
ai-scraper db export --format rss --output feed.xml
ai-scraper db export --format markdown --output repositories.md

# Start REST API server (with authentication)
ai-scraper serve --port 8080 --auth

# Schedule periodic scraping (daily at 9am)
ai-scraper schedule --cron "0 9 * * *"

# Backup database
ai-scraper db backup
ai-scraper db restore backup_file.db.gz

Configuration

Create ai-scraper.yaml to customize:

github:
  token: ${GITHUB_TOKEN}
  cache_ttl: 3600

gitlab:
  token: ${GITLAB_TOKEN}  # Optional, for GitLab scraping
  base_url: https://gitlab.com/api/v4  # Or your self-hosted GitLab URL
  cache_ttl: 3600

filter:
  min_stars: 100
  keywords:
    - ai
    - machine-learning
    - 人工智能  # Chinese keyword support
  topics:
    - ai
    - deep-learning

scrape:
  max_results: 500
  concurrency: 5
  concurrent_requests: 5

database:
  path: ./data/ai_scraper.db
  backup_dir: ./backups
  max_backups: 10

api:
  auth_enabled: true
  api_keys:
    - as_your_api_key_here

webhooks:
  enabled: false
  endpoints:
    - url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
      events: [scrape_complete, trending_found]

Commands

Command Description
ai-scraper scrape Scrape AI repositories from GitHub
ai-scraper scrape --platform gitlab Scrape from GitLab
ai-scraper scrape --platform gitlab --gitlab-url URL Scrape from self-hosted GitLab
ai-scraper scrape --concurrent Concurrent scraping for faster results
ai-scraper scrape --incremental Incremental scraping (only updated repos)
ai-scraper scrape --since 7d Fetch repos updated in last 7 days
ai-scraper scrape --resume Resume interrupted scrape
ai-scraper scrape --progress Show progress bar during scraping
ai-scraper interactive Start interactive menu mode
ai-scraper list List scraped repositories
ai-scraper trending Show trending repositories by star growth
ai-scraper serve Start REST API server
ai-scraper serve --auth Start API server with authentication
ai-scraper schedule Schedule periodic scraping
ai-scraper keywords list List all keywords
ai-scraper keywords extract Extract keywords from database
ai-scraper keywords clear Clear keywords
ai-scraper config init Initialize config file
ai-scraper config show Show current config
ai-scraper db stats Show database statistics
ai-scraper db export Export data to CSV/JSON/HTML/Excel/RSS
ai-scraper db clean --invalid Remove repositories with invalid data
ai-scraper db clean --vacuum Optimize database size
ai-scraper db backup Create database backup
ai-scraper db restore Restore from backup
ai-scraper db backups List available backups

REST API Endpoints

When running ai-scraper serve:

Endpoint Description
GET /api/repos List repositories with filters
GET /api/repos/{id} Get specific repository
GET /api/stats Get database statistics
GET /api/trending Get trending repositories
GET /api/search?q=... Search repositories

Authentication: Pass X-API-Key header when --auth is enabled.

Project Structure

github-ai-scraper/
├── src/ai_scraper/
│   ├── cli.py              # CLI entry point
│   ├── config.py           # Configuration management
│   ├── interactive.py      # Interactive menu mode
│   ├── classifier.py       # Repository classification
│   ├── dedup.py            # Deduplication utilities
│   ├── health.py           # Health assessment
│   ├── scheduler.py        # Task scheduling
│   ├── webhooks.py         # Webhook notifications
│   ├── plugins.py          # Plugin system
│   ├── logging_config.py   # Logging configuration
│   ├── api_server.py       # REST API server
│   ├── auth.py             # API authentication
│   ├── retry.py            # Error recovery
│   ├── i18n.py             # Multi-language support
│   ├── scrape_progress.py  # Resume support
│   ├── backup.py           # Database backup
│   ├── config_watcher.py   # Config hot reload
│   ├── secure_storage.py   # Token encryption
│   ├── api/
│   │   ├── github.py       # GitHub API client
│   │   └── rate_limiter.py # Token bucket rate limiter
│   ├── models/
│   │   └── repository.py   # Data models (Pydantic)
│   ├── filters/
│   │   └── ai_filter.py    # AI relevance filter
│   ├── output/
│   │   ├── markdown.py     # Markdown exporter
│   │   ├── html.py         # HTML exporter
│   │   ├── excel.py        # Excel exporter
│   │   └── rss.py          # RSS exporter
│   └── storage/
│       ├── database.py     # SQLite storage (sync)
│       └── async_database.py # SQLite storage (async)
├── plugins/                # Example plugins
├── tests/                  # Test suite
├── Dockerfile              # Docker support
├── docker-compose.yml      # Docker compose
├── .github/workflows/      # CI/CD workflows
└── ai-scraper.yaml         # Default configuration

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Build Docker image
docker build -t ai-scraper .

API Rate Limits

  • Without token: 60 requests/hour
  • With token: 5000 requests/hour

Set GITHUB_TOKEN environment variable for higher limits.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

github_ai_scraper-0.1.2.tar.gz (204.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

github_ai_scraper-0.1.2-py3-none-any.whl (67.6 kB view details)

Uploaded Python 3

File details

Details for the file github_ai_scraper-0.1.2.tar.gz.

File metadata

  • Download URL: github_ai_scraper-0.1.2.tar.gz
  • Upload date:
  • Size: 204.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for github_ai_scraper-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e333c7fde9cd2af15e9cb5bcfe047efe6674b5dcbe3ee138d7b1202ab10952f6
MD5 42d08bab7524c9679e15c30dc0c63f9f
BLAKE2b-256 558117fafdafc27cae5e833cb247e867b9290dfcbab01f03edbcf75796137c41

See more details on using hashes here.

File details

Details for the file github_ai_scraper-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for github_ai_scraper-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2f8aa4f009478a7bc9710c70c302abf3a4161a3188e8d9b60ff48240b6775c6e
MD5 ac264db060ab8582e8aff7b7a381f39c
BLAKE2b-256 a265cc7574a9c09a8e41cce4303ec08ea303c398bc56465f86cf22f513ff7745

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page