A CLI tool for discovering and scraping AI-related high-star repositories from GitHub and GitLab

These details have not been verified by PyPI

Project links

Project description

GitHub/GitLab AI Scraper

English | 简体中文

A CLI tool for scraping AI-related high-star repositories from GitHub and GitLab.

Features

Multi-platform support - Scrape from GitHub or GitLab (including self-hosted instances)
Search and filter AI-related repositories by keywords and topics
Dynamic keyword extraction - Automatically learns new keywords from scraped repos
Markdown/HTML/Excel/RSS report generation - Multiple export formats with Chinese translation
Incremental scraping - Fetch only updated repos with --since flag
Resume support - Continue interrupted scrapes with progress tracking
Progress bar display - Visual progress during scraping
Interactive CLI mode - Menu-driven interface for easy use
Concurrent scraping - Parallel requests for faster results
Multi-language search - Support for Chinese and English keywords
Local SQLite storage with trend analysis
Configurable filtering and scraping options
Rate limiting with GitHub/GitLab API token support
Export to CSV/JSON/HTML/Excel/RSS/Markdown formats
REST API server - Access data via HTTP endpoints with optional authentication
Scheduled scraping - Cron-based periodic scraping
Webhook notifications - Notify external services on events
Plugin system - Extend functionality with custom plugins
Repository health assessment - Activity, popularity, maintenance scores
Intelligent classification - LLM, CV, NLP, MLOps, AI Infrastructure categories
Deduplication - Fork and mirror detection, content similarity
Secure token storage - Encrypted storage for sensitive tokens
Database backup - Automatic backup and restore functionality
Error recovery - Retry logic with exponential backoff

Installation

# Install from PyPI
pip install github-ai-scraper

# Or install from source for development
pip install -e ".[dev]"

Quick Start

# Set your GitHub token (optional, increases rate limit)
export GITHUB_TOKEN=your_token_here

# Scrape AI repositories from GitHub (default)
ai-scraper scrape

# Scrape from GitLab
ai-scraper scrape --platform gitlab

# Scrape from self-hosted GitLab
ai-scraper scrape --platform gitlab --gitlab-url https://your-gitlab.com/api/v4

# Scrape with progress bar
ai-scraper scrape --progress

# Concurrent scraping (faster)
ai-scraper scrape --concurrent

# Incremental scraping (repos updated in last 7 days)
ai-scraper scrape --incremental
ai-scraper scrape --since 7d

# Resume interrupted scrape
ai-scraper scrape --resume

# Interactive mode
ai-scraper interactive

# List scraped repositories
ai-scraper list

# Show trending repositories
ai-scraper trending

# Export data
ai-scraper db export --format html --output index.html
ai-scraper db export --format xlsx --output repos.xlsx
ai-scraper db export --format rss --output feed.xml
ai-scraper db export --format markdown --output repositories.md

# Start REST API server (with authentication)
ai-scraper serve --port 8080 --auth

# Schedule periodic scraping (daily at 9am)
ai-scraper schedule --cron "0 9 * * *"

# Backup database
ai-scraper db backup
ai-scraper db restore backup_file.db.gz

Configuration

Create ai-scraper.yaml to customize:

github:
  token: ${GITHUB_TOKEN}
  cache_ttl: 3600

gitlab:
  token: ${GITLAB_TOKEN}  # Optional, for GitLab scraping
  base_url: https://gitlab.com/api/v4  # Or your self-hosted GitLab URL
  cache_ttl: 3600

filter:
  min_stars: 100
  keywords:
    - ai
    - machine-learning
    - 人工智能  # Chinese keyword support
  topics:
    - ai
    - deep-learning

scrape:
  max_results: 500
  concurrency: 5
  concurrent_requests: 5

database:
  path: ./data/ai_scraper.db
  backup_dir: ./backups
  max_backups: 10

api:
  auth_enabled: true
  api_keys:
    - as_your_api_key_here

webhooks:
  enabled: false
  endpoints:
    - url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
      events: [scrape_complete, trending_found]

Commands

Command	Description
`ai-scraper scrape`	Scrape AI repositories from GitHub
`ai-scraper scrape --platform gitlab`	Scrape from GitLab
`ai-scraper scrape --platform gitlab --gitlab-url URL`	Scrape from self-hosted GitLab
`ai-scraper scrape --concurrent`	Concurrent scraping for faster results
`ai-scraper scrape --incremental`	Incremental scraping (only updated repos)
`ai-scraper scrape --since 7d`	Fetch repos updated in last 7 days
`ai-scraper scrape --resume`	Resume interrupted scrape
`ai-scraper scrape --progress`	Show progress bar during scraping
`ai-scraper interactive`	Start interactive menu mode
`ai-scraper list`	List scraped repositories
`ai-scraper trending`	Show trending repositories by star growth
`ai-scraper serve`	Start REST API server
`ai-scraper serve --auth`	Start API server with authentication
`ai-scraper schedule`	Schedule periodic scraping
`ai-scraper keywords list`	List all keywords
`ai-scraper keywords extract`	Extract keywords from database
`ai-scraper keywords clear`	Clear keywords
`ai-scraper config init`	Initialize config file
`ai-scraper config show`	Show current config
`ai-scraper db stats`	Show database statistics
`ai-scraper db export`	Export data to CSV/JSON/HTML/Excel/RSS
`ai-scraper db clean --invalid`	Remove repositories with invalid data
`ai-scraper db clean --vacuum`	Optimize database size
`ai-scraper db backup`	Create database backup
`ai-scraper db restore`	Restore from backup
`ai-scraper db backups`	List available backups

REST API Endpoints

When running ai-scraper serve:

Endpoint	Description
`GET /api/repos`	List repositories with filters
`GET /api/repos/{id}`	Get specific repository
`GET /api/stats`	Get database statistics
`GET /api/trending`	Get trending repositories
`GET /api/search?q=...`	Search repositories

Authentication: Pass X-API-Key header when --auth is enabled.

Project Structure

github-ai-scraper/
├── src/ai_scraper/
│   ├── cli.py              # CLI entry point
│   ├── config.py           # Configuration management
│   ├── interactive.py      # Interactive menu mode
│   ├── classifier.py       # Repository classification
│   ├── dedup.py            # Deduplication utilities
│   ├── health.py           # Health assessment
│   ├── scheduler.py        # Task scheduling
│   ├── webhooks.py         # Webhook notifications
│   ├── plugins.py          # Plugin system
│   ├── logging_config.py   # Logging configuration
│   ├── api_server.py       # REST API server
│   ├── auth.py             # API authentication
│   ├── retry.py            # Error recovery
│   ├── i18n.py             # Multi-language support
│   ├── scrape_progress.py  # Resume support
│   ├── backup.py           # Database backup
│   ├── config_watcher.py   # Config hot reload
│   ├── secure_storage.py   # Token encryption
│   ├── api/
│   │   ├── github.py       # GitHub API client
│   │   └── rate_limiter.py # Token bucket rate limiter
│   ├── models/
│   │   └── repository.py   # Data models (Pydantic)
│   ├── filters/
│   │   └── ai_filter.py    # AI relevance filter
│   ├── output/
│   │   ├── markdown.py     # Markdown exporter
│   │   ├── html.py         # HTML exporter
│   │   ├── excel.py        # Excel exporter
│   │   └── rss.py          # RSS exporter
│   └── storage/
│       ├── database.py     # SQLite storage (sync)
│       └── async_database.py # SQLite storage (async)
├── plugins/                # Example plugins
├── tests/                  # Test suite
├── Dockerfile              # Docker support
├── docker-compose.yml      # Docker compose
├── .github/workflows/      # CI/CD workflows
└── ai-scraper.yaml         # Default configuration

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Build Docker image
docker build -t ai-scraper .

API Rate Limits

Without token: 60 requests/hour
With token: 5000 requests/hour

Set GITHUB_TOKEN environment variable for higher limits.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

github_ai_scraper-0.1.2.tar.gz (204.1 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

github_ai_scraper-0.1.2-py3-none-any.whl (67.6 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file github_ai_scraper-0.1.2.tar.gz.

File metadata

Download URL: github_ai_scraper-0.1.2.tar.gz
Upload date: May 19, 2026
Size: 204.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for github_ai_scraper-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`e333c7fde9cd2af15e9cb5bcfe047efe6674b5dcbe3ee138d7b1202ab10952f6`
MD5	`42d08bab7524c9679e15c30dc0c63f9f`
BLAKE2b-256	`558117fafdafc27cae5e833cb247e867b9290dfcbab01f03edbcf75796137c41`

See more details on using hashes here.

File details

Details for the file github_ai_scraper-0.1.2-py3-none-any.whl.

File metadata

Download URL: github_ai_scraper-0.1.2-py3-none-any.whl
Upload date: May 19, 2026
Size: 67.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for github_ai_scraper-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f8aa4f009478a7bc9710c70c302abf3a4161a3188e8d9b60ff48240b6775c6e`
MD5	`ac264db060ab8582e8aff7b7a381f39c`
BLAKE2b-256	`a265cc7574a9c09a8e41cce4303ec08ea303c398bc56465f86cf22f513ff7745`

See more details on using hashes here.

github-ai-scraper 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GitHub/GitLab AI Scraper

Features

Installation

Quick Start

Configuration

Commands

REST API Endpoints

Project Structure

Development

API Rate Limits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes