Skip to main content

A stealthy headless browser service for AI agents. Bypasses anti-bot protections to fetch content and convert to clean Markdown.

Project description

GhostFetch

A stealthy headless browser service for AI agents. Bypasses anti-bot protections to fetch content from sites like X.com and converts it to clean Markdown.

Features

  • Zero Setup: Install with pip, browsers auto-install on first run
  • Synchronous API: Single request returns content directly (no polling needed)
  • Ghost Protocol: Advanced proxy rotation and cohesive browser fingerprinting
  • Stealth Browsing: Uses Playwright with custom flags and canvas noise injection
  • Markdown Output: Automatically converts HTML to Markdown for easy LLM consumption
  • Metadata Extraction: Automatically extracts title, author, publish date, and images
  • X.com Support: Logic to wait for dynamic content on Twitter/X
  • Async Job Queue: Process multiple requests concurrently with intelligent retry
  • Persistent Sessions: Cookie/localStorage persistence per domain
  • Webhook Callbacks: Get notified via HTTP when jobs complete
  • GitHub Integration: Post results directly to GitHub issues
  • Dual Mode: CLI tool or REST API service
  • Docker Ready: Pre-configured Docker setup with docker-compose

Quick Start

For AI Agents (Simplest)

# Install from source
pip install -e .

# Fetch any URL (auto-installs browsers on first run)
ghostfetch "https://x.com/user/status/123"

# Or use the Python SDK
python -c "from ghostfetch import fetch; print(fetch('https://example.com')['markdown'])"

For API Usage

# Start the server
ghostfetch serve

# Fetch synchronously (blocks until done)
curl "http://localhost:8000/fetch/sync?url=https://example.com"

Installation

Option 1: Docker Hub (Fastest)

# Pull and run
docker run -p 8000:8000 iarsalanshah/ghostfetch

# Or with docker-compose
docker-compose up

Option 2: pip install

# From PyPI (when published)
pip install ghostfetch

# Or from source
git clone https://github.com/iArsalanshah/GhostFetch.git
cd GhostFetch
pip install -e .

# Browsers install automatically on first use, or run:
ghostfetch setup

Option 3: Manual Setup

cd GhostFetch

# Create virtual environment (optional)
python3 -m venv venv
source venv/bin/activate

# Install packages & browser
pip install -r requirements.txt
playwright install chromium

Usage

1. CLI Mode (Zero Setup)

Using the ghostfetch CLI (after pip install):

# Basic fetch
ghostfetch "https://x.com/user/status/123"

# JSON output (for parsing)
ghostfetch "https://example.com" --json

# Metadata only
ghostfetch "https://example.com" --metadata-only

# Quiet mode (no progress messages)
ghostfetch "https://example.com" --quiet

Using the legacy module directly:

python -m src.core.scraper "https://x.com/user/status/123"

Output:

--- Metadata ---
{
  "title": "...",
  "author": "...",
  "publish_date": "...",
  "images": [...]
}

--- Markdown ---
[converted markdown content]

2. API Mode (Service for Agents)

Start the server:

# Using CLI
ghostfetch serve

# Or directly
python main.py

The server will start at http://localhost:8000.

API Endpoints

Synchronous Fetch (Recommended for AI Agents)

  • POST /fetch/sync — blocks until content is ready
  • GET /fetch/sync?url=... — same, but via query parameter

Example (POST):

curl -X POST "http://localhost:8000/fetch/sync" \
     -H "Content-Type: application/json" \
     -d '{"url": "https://example.com", "timeout": 60}'

Example (GET):

curl "http://localhost:8000/fetch/sync?url=https://example.com"

Response:

{
  "metadata": {
    "title": "Example Domain",
    "author": "",
    "publish_date": "",
    "images": []
  },
  "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples..."
}

Async Fetch (For Background Processing)

  • POST /fetch (returns 202 Accepted)
  • Body:
    {
      "url": "https://example.com",
      "callback_url": "https://your-server.com/webhook",
      "github_issue": 123
    }
    

Example:

curl -X POST "http://localhost:8000/fetch" \
     -H "Content-Type: application/json" \
     -d '{"url": "https://x.com/user/status/123"}'

Response:

{
  "job_id": "a1b2c3d4-e5f6-7890",
  "url": "https://x.com/user/status/123",
  "status": "queued"
}

Check Job Status

  • GET /job/{job_id}

Example:

curl "http://localhost:8000/job/a1b2c3d4-e5f6-7890"

Response (Completed):

{
  "id": "a1b2c3d4-e5f6-7890",
  "url": "https://x.com/mrnacknack/status/2016134416897360212",
  "status": "completed",
  "result": {
    "metadata": {
      "title": "...",
      "author": "...",
      "publish_date": "...",
      "images": [...]
    },
    "markdown": "..."
  },
  "created_at": 1706000000,
  "started_at": 1706000001,
  "completed_at": 1706000010
}

Health Check

  • GET /health

Response:

{
  "status": "ok",
  "browser_connected": true,
  "active_jobs_queue": 2,
  "active_browser_contexts": 1,
  "concurrency_limit": 2
}

Integration Examples

Python Agent with Job Polling

import requests
import time

def fetch_content_async(url):
    # Submit job
    response = requests.post(
        "http://localhost:8000/fetch",
        json={"url": url}
    )
    job_id = response.json()["job_id"]
    
    # Poll until completed
    while True:
        job_response = requests.get(f"http://localhost:8000/job/{job_id}")
        job = job_response.json()
        
        if job["status"] == "completed":
            return job["result"]["markdown"]
        elif job["status"] == "failed":
            raise Exception(f"Job failed: {job['error']}")
        
        time.sleep(1)  # Poll every second

Using Webhook Callbacks

import requests

# Your webhook endpoint receives:
# POST to callback_url with:
# {
#   "job_id": "...",
#   "url": "...",
#   "status": "completed",
#   "data": {"metadata": {...}, "markdown": "..."},
#   "error": null,
#   "error_details": null
# }

requests.post(
    "http://localhost:8000/fetch",
    json={
        "url": "https://example.com",
        "callback_url": "https://your-server.com/webhooks/ghostfetch"
    }
)

GitHub Integration

When you include a github_issue parameter, GhostFetch will post results as comments:

requests.post(
    "http://localhost:8000/fetch",
    json={
        "url": "https://example.com",
        "github_issue": 42  # Post result as comment on issue #42
    }
)

Requires:

  • GitHub CLI (gh command) installed
  • GITHUB_TOKEN environment variable set
  • GITHUB_REPO configured

Integration with AI Agents

Your agent can submit a fetch job and poll for results:

import requests
import time

def fetch_blocked_content(url):
    response = requests.post(
        "http://localhost:8000/fetch",
        json={"url": url}
    )
    job_id = response.json()["job_id"]
    
    # Poll for completion
    max_retries = 60
    for _ in range(max_retries):
        result = requests.get(f"http://localhost:8000/job/{job_id}").json()
        if result["status"] == "completed":
            return result["result"]["markdown"]
        elif result["status"] == "failed":
            return f"Error: {result['error']}"
        time.sleep(1)
    
    return "Timeout waiting for result"

Configuration

GhostFetch is configured via environment variables (see src/utils/config.py) or the proxies.txt file.

  • Proxies: Add one proxy per line to proxies.txt in the format http://user:pass@host:port.
  • Strategy: Set PROXY_STRATEGY to round_robin or random.

Environment Variables

# API Server
GHOSTFETCH_HOST=0.0.0.0
GHOSTFETCH_PORT=8000

# Scraper Settings
MAX_CONCURRENT_BROWSERS=2          # Number of concurrent browser contexts
MIN_DOMAIN_DELAY=10                # Minimum seconds between requests to same domain
MAX_REQUESTS_PER_BROWSER=50        # Restart browser after N requests
MAX_RETRIES=3                      # Retry attempts for failed requests

# Sync Endpoint Settings
SYNC_TIMEOUT_DEFAULT=120           # Default timeout for /fetch/sync (seconds)
MAX_SYNC_TIMEOUT=300               # Maximum allowed timeout (5 minutes)

# GitHub Integration
GITHUB_REPO=iArsalanshah/GhostFetch  # Owner/repo for issue comments

# Persistence
DATABASE_URL=sqlite:///./storage/jobs.db
STORAGE_DIR=storage

# Job Lifecycle
JOB_TTL_SECONDS=86400              # Delete completed jobs after 24 hours

Docker Environment

Create a .env file for docker-compose:

MAX_CONCURRENT_BROWSERS=2
MIN_DOMAIN_DELAY=10
GITHUB_REPO=your-org/your-repo
JOB_TTL_SECONDS=86400

Then run:

docker-compose --env-file .env up

Specific Handling

  • X/Twitter: The scraper waits for [data-testid="tweetText"] to ensure the tweet content is loaded before capturing.

⚠️ Important: Rate Limiting & Ethics

This tool bypasses anti-bot protections. Use responsibly:

  • Respect robots.txt - Check site policies before scraping
  • Implement delays - Use MIN_DOMAIN_DELAY (default: 10 seconds) to avoid overloading servers
  • Throttle requests - Reduce MAX_CONCURRENT_BROWSERS for high-volume scraping
  • Terms of Service - Ensure your use complies with target site's ToS
  • Authentication - When possible, use authorized access instead of bypassing protections

Recommended Settings for Production

# Conservative (respectful scraping)
MIN_DOMAIN_DELAY=30
MAX_CONCURRENT_BROWSERS=1

# Moderate
MIN_DOMAIN_DELAY=15
MAX_CONCURRENT_BROWSERS=2

# Aggressive (only for your own content)
MIN_DOMAIN_DELAY=5
MAX_CONCURRENT_BROWSERS=4

Production Deployment Guide

1. Proxy Support (Recommended for High-Volume)

For serious stealth, rotate through residential proxies:

# Configure proxies.txt with your proxy list
# GhostFetch will automatically rotate and track health.

Recommended proxy providers:

  • BrightData (datacenter/residential)
  • ScrapingBee (cloud-based)
  • Oxylabs (residential networks)
  • Local proxy rotation with tools like scrapy-proxy-pool

2. Caching Layer (Reduce Redundant Requests)

For repeated fetches, implement Redis caching:

import redis

cache = redis.Redis(host='localhost', port=6379)

async def fetch_with_cache(url, ttl=3600):
    cached = cache.get(url)
    if cached:
        return json.loads(cached)
    
    result = await scraper.fetch(url)
    cache.setex(url, ttl, json.dumps(result))
    return result

Docker Compose with Redis:

services:
  ghostfetch:
    build: .
    ports:
      - "8000:8000"
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

3. Security & Authentication

Add API key authentication before exposing publicly:

from fastapi import Header, HTTPException

VALID_API_KEYS = set(os.getenv("API_KEYS", "").split(","))

@app.post("/fetch")
async def fetch_endpoint(request: FetchRequest, x_api_key: str = Header(None)):
    if not x_api_key or x_api_key not in VALID_API_KEYS:
        raise HTTPException(status_code=401, detail="Invalid API key")
    # ... rest of endpoint

Usage:

curl -X POST "http://localhost:8000/fetch" \
     -H "x-api-key: your-secret-key" \
     -H "Content-Type: application/json" \
     -d '{"url": "https://example.com"}'

4. Monitoring & Observability

Log rotation (automatically configured):

  • Logs stored in storage/scraper.log
  • Max 5MB per file, keeps 5 backups
  • Check for errors: tail -f storage/scraper.log | grep ERROR

Database queries for analytics:

sqlite3 storage/jobs.db "SELECT status, COUNT(*) FROM jobs GROUP BY status;"

Health check monitoring:

while true; do
  curl http://localhost:8000/health | jq .
  sleep 30
done

3. Model Context Protocol (MCP)

GhostFetch includes an MCP server for integration with Claude Desktop and other MCP-aware agents.

Configuration (claude_desktop_config.json):

{
  "mcpServers": {
    "ghostfetch": {
      "command": "python",
      "args": ["-m", "ghostfetch.mcp_server"],
      "env": {
        "SYNC_TIMEOUT_DEFAULT": "120"
      }
    }
  }
}

This exposes a ghostfetch tool to the agent:

  • url: The URL to fetch
  • context_id: Optional session ID
  • timeout: Optional timeout (seconds)

Performance & Monitoring

Logging

Logs are written to storage/scraper.log with rotation (5MB max):

  • Stream output to console (INFO level)
  • File output with detailed format

Load Testing

Run included load tests:

# Python async load test
python scripts/load_test.py

Database

Job history is stored in storage/jobs.db (SQLite):

  • Persistent across restarts
  • Automatic cleanup of old jobs (configurable TTL)
  • Query jobs directly for analytics/debugging

Troubleshooting

Playwright Error: Executable doesn't exist If you see an error about the browser executable not being found, run:

playwright install chromium

Timeout Errors If fetching times out, it might be due to slow network or heavy anti-bot protections. You can try:

  • Increasing timeout in src/core/scraper.py (default is 60000ms)
  • Increasing MIN_DOMAIN_DELAY to avoid rate-limiting

Job Stuck in "Processing" Check logs in storage/scraper.log for errors. If stuck, restart the service.

GitHub Comments Not Posting Ensure:

  • gh CLI is installed: brew install gh (macOS) or apt install gh (Linux)
  • You're authenticated: gh auth login
  • GITHUB_REPO is set correctly
  • GITHUB_TOKEN is in your environment

High Memory Usage Reduce MAX_CONCURRENT_BROWSERS or MAX_REQUESTS_PER_BROWSER in configuration.

Publishing Setup

Docker Hub

To enable automated Docker image publishing:

  1. Create a Docker Hub account and repository (your-username/ghostfetch)
  2. Generate an access token at https://hub.docker.com/settings/security
  3. Add these secrets to your GitHub repository:
    • DOCKERHUB_USERNAME: Your Docker Hub username
    • DOCKERHUB_TOKEN: Your access token

Images will be published automatically on pushes to main and version tags.

PyPI (Trusted Publishing)

To enable automated PyPI publishing:

  1. Go to https://pypi.org/manage/account/publishing/
  2. Add a new pending publisher:
    • PyPI Project Name: ghostfetch
    • Owner: iArsalanshah
    • Repository: GhostFetch
    • Workflow name: pypi-publish.yml
    • Environment: pypi
  3. Create a GitHub Release to trigger publishing

No API tokens needed - uses OIDC trusted publishing.

Legal Disclaimer

GhostFetch is provided for educational and research purposes only. Users are solely responsible for ensuring their use complies with:

  1. The Terms of Service of target websites
  2. Applicable laws regarding data access and automation (including CFAA in the US)
  3. The robots.txt and scraping policies of target domains

This tool should not be used to:

  • Scrape private or authenticated content without authorization
  • Circumvent security measures on sites where such circumvention violates applicable law
  • Violate the Terms of Service of social media platforms (including X/Twitter)

The authors assume no liability for misuse of this software.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostfetch-1.0.0.tar.gz (31.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ghostfetch-1.0.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file ghostfetch-1.0.0.tar.gz.

File metadata

  • Download URL: ghostfetch-1.0.0.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ghostfetch-1.0.0.tar.gz
Algorithm Hash digest
SHA256 850522705fed626804a37e765ea26e32268dc88a6d450a33d7c7d4a27dde4dcb
MD5 5b84070d819ce5974b0d33f72bedd230
BLAKE2b-256 8e2d6bacf0187f45b516f0102608d04903cbb5a6a6af1baa0a14703d7798658d

See more details on using hashes here.

Provenance

The following attestation bundles were made for ghostfetch-1.0.0.tar.gz:

Publisher: pypi-publish.yml on iArsalanshah/GhostFetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ghostfetch-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: ghostfetch-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ghostfetch-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 242e0793afc5c183ef117d6cbd57fff0ed2cb0a09cc48781400efce4c5f7019e
MD5 d212d17186534b55f8f233f14706e5a6
BLAKE2b-256 4ceeb5f08d02efd4fa804a87ca2d9abb6b3985cfa6dada1afb9497c00ecd9825

See more details on using hashes here.

Provenance

The following attestation bundles were made for ghostfetch-1.0.0-py3-none-any.whl:

Publisher: pypi-publish.yml on iArsalanshah/GhostFetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page