A stealthy headless browser service for AI agents. Bypasses anti-bot protections to fetch content and convert to clean Markdown.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

iarsalanshah

These details have not been verified by PyPI

Project description

GhostFetch

A stealthy headless browser service for AI agents. Bypasses anti-bot protections to fetch content from sites like X.com and converts it to clean Markdown.

Features

Zero Setup: Install with pip, browsers auto-install on first run
Synchronous API: Single request returns content directly (no polling needed)
Ghost Protocol: Advanced proxy rotation and cohesive browser fingerprinting
Stealth Browsing: Uses Playwright with custom flags and canvas noise injection
Markdown Output: Automatically converts HTML to Markdown for easy LLM consumption
Metadata Extraction: Automatically extracts title, author, publish date, and images
X.com Support: Logic to wait for dynamic content on Twitter/X
Async Job Queue: Process multiple requests concurrently with intelligent retry
Persistent Sessions: Cookie/localStorage persistence per domain
Webhook Callbacks: Get notified via HTTP when jobs complete
GitHub Integration: Post results directly to GitHub issues
Dual Mode: CLI tool or REST API service
Docker Ready: Pre-configured Docker setup with docker-compose

Quick Start

For AI Agents (Simplest)

# Install from PyPI
pip install ghostfetch

# Fetch any URL (auto-installs browsers on first run)
ghostfetch "https://x.com/user/status/123"

# Or use the Python SDK
python -c "from ghostfetch import fetch; print(fetch('https://example.com')['markdown'])"

For API Usage

# Start the server
ghostfetch serve

# Fetch synchronously (blocks until done)
curl "http://localhost:8000/fetch/sync?url=https://example.com"

Installation

Option 1: Docker Hub (Fastest)

# Pull and run
docker run -p 8000:8000 iarsalanshah/ghostfetch

# Or with docker-compose
docker-compose up

Option 2: pip install

# From PyPI
pip install ghostfetch

# Or from source
git clone https://github.com/iArsalanshah/GhostFetch.git
cd GhostFetch
pip install -e .

# Browsers install automatically on first use, or run:
ghostfetch setup

Option 3: Manual Setup

cd GhostFetch

# Create virtual environment (optional)
python3 -m venv venv
source venv/bin/activate

# Install packages & browser
pip install -r requirements.txt
playwright install chromium

Usage

1. CLI Mode (Zero Setup)

Using the ghostfetch CLI (after pip install):

# Basic fetch
ghostfetch "https://x.com/user/status/123"

# JSON output (for parsing)
ghostfetch "https://example.com" --json

# Metadata only
ghostfetch "https://example.com" --metadata-only

# Quiet mode (no progress messages)
ghostfetch "https://example.com" --quiet

Using the legacy module directly:

python -m src.core.scraper "https://x.com/user/status/123"

Output:

--- Metadata ---
{
  "title": "...",
  "author": "...",
  "publish_date": "...",
  "images": [...]
}

--- Markdown ---
[converted markdown content]

2. API Mode (Service for Agents)

Start the server:

# Using CLI
ghostfetch serve

# Or directly
python main.py

The server will start at http://localhost:8000.

API Endpoints

Synchronous Fetch (Recommended for AI Agents)

POST /fetch/sync — blocks until content is ready
GET /fetch/sync?url=... — same, but via query parameter

Example (POST):

curl -X POST "http://localhost:8000/fetch/sync" \
     -H "Content-Type: application/json" \
     -d '{"url": "https://example.com", "timeout": 60}'

Example (GET):

curl "http://localhost:8000/fetch/sync?url=https://example.com"

Response:

{
  "metadata": {
    "title": "Example Domain",
    "author": "",
    "publish_date": "",
    "images": []
  },
  "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples..."
}

Async Fetch (For Background Processing)

POST /fetch (returns 202 Accepted)

Body:

{
  "url": "https://example.com",
  "callback_url": "https://your-server.com/webhook",
  "github_issue": 123
}

Example:

curl -X POST "http://localhost:8000/fetch" \
     -H "Content-Type: application/json" \
     -d '{"url": "https://x.com/user/status/123"}'

Response:

{
  "job_id": "a1b2c3d4-e5f6-7890",
  "url": "https://x.com/user/status/123",
  "status": "queued"
}

Check Job Status

GET /job/{job_id}

Example:

curl "http://localhost:8000/job/a1b2c3d4-e5f6-7890"

Response (Completed):

{
  "id": "a1b2c3d4-e5f6-7890",
  "url": "https://x.com/mrnacknack/status/2016134416897360212",
  "status": "completed",
  "result": {
    "metadata": {
      "title": "...",
      "author": "...",
      "publish_date": "...",
      "images": [...]
    },
    "markdown": "..."
  },
  "created_at": 1706000000,
  "started_at": 1706000001,
  "completed_at": 1706000010
}

Health Check

GET /health

Response:

{
  "status": "ok",
  "browser_connected": true,
  "active_jobs_queue": 2,
  "active_browser_contexts": 1,
  "concurrency_limit": 2
}

Integration Examples

Python Agent with Job Polling

import requests
import time

def fetch_content_async(url):
    # Submit job
    response = requests.post(
        "http://localhost:8000/fetch",
        json={"url": url}
    )
    job_id = response.json()["job_id"]
    
    # Poll until completed
    while True:
        job_response = requests.get(f"http://localhost:8000/job/{job_id}")
        job = job_response.json()
        
        if job["status"] == "completed":
            return job["result"]["markdown"]
        elif job["status"] == "failed":
            raise Exception(f"Job failed: {job['error']}")
        
        time.sleep(1)  # Poll every second

Using Webhook Callbacks

import requests

# Your webhook endpoint receives:
# POST to callback_url with:
# {
#   "job_id": "...",
#   "url": "...",
#   "status": "completed",
#   "data": {"metadata": {...}, "markdown": "..."},
#   "error": null,
#   "error_details": null
# }

requests.post(
    "http://localhost:8000/fetch",
    json={
        "url": "https://example.com",
        "callback_url": "https://your-server.com/webhooks/ghostfetch"
    }
)

GitHub Integration

When you include a github_issue parameter, GhostFetch will post results as comments:

requests.post(
    "http://localhost:8000/fetch",
    json={
        "url": "https://example.com",
        "github_issue": 42  # Post result as comment on issue #42
    }
)

Requires:

GitHub CLI (gh command) installed
GITHUB_TOKEN environment variable set
GITHUB_REPO configured

Integration with AI Agents

Your agent can submit a fetch job and poll for results:

import requests
import time

def fetch_blocked_content(url):
    response = requests.post(
        "http://localhost:8000/fetch",
        json={"url": url}
    )
    job_id = response.json()["job_id"]
    
    # Poll for completion
    max_retries = 60
    for _ in range(max_retries):
        result = requests.get(f"http://localhost:8000/job/{job_id}").json()
        if result["status"] == "completed":
            return result["result"]["markdown"]
        elif result["status"] == "failed":
            return f"Error: {result['error']}"
        time.sleep(1)
    
    return "Timeout waiting for result"

Configuration

GhostFetch is configured via environment variables (see src/utils/config.py) or the proxies.txt file.

Proxies: Add one proxy per line to proxies.txt in the format http://user:pass@host:port.
Strategy: Set PROXY_STRATEGY to round_robin or random.

Environment Variables

# API Server
GHOSTFETCH_HOST=0.0.0.0
GHOSTFETCH_PORT=8000

# Scraper Settings
MAX_CONCURRENT_BROWSERS=2          # Number of concurrent browser contexts
MIN_DOMAIN_DELAY=10                # Minimum seconds between requests to same domain
MAX_REQUESTS_PER_BROWSER=50        # Restart browser after N requests
MAX_RETRIES=3                      # Retry attempts for failed requests

# Sync Endpoint Settings
SYNC_TIMEOUT_DEFAULT=120           # Default timeout for /fetch/sync (seconds)
MAX_SYNC_TIMEOUT=300               # Maximum allowed timeout (5 minutes)

# GitHub Integration
GITHUB_REPO=iArsalanshah/GhostFetch  # Owner/repo for issue comments

# Persistence
DATABASE_URL=sqlite:///./storage/jobs.db
STORAGE_DIR=storage

# Job Lifecycle
JOB_TTL_SECONDS=86400              # Delete completed jobs after 24 hours

Docker Environment

Create a .env file for docker-compose:

MAX_CONCURRENT_BROWSERS=2
MIN_DOMAIN_DELAY=10
GITHUB_REPO=your-org/your-repo
JOB_TTL_SECONDS=86400

Then run:

docker-compose --env-file .env up

Specific Handling

X/Twitter: The scraper waits for [data-testid="tweetText"] to ensure the tweet content is loaded before capturing.

⚠️ Important: Rate Limiting & Ethics

This tool bypasses anti-bot protections. Use responsibly:

Respect robots.txt - Check site policies before scraping
Implement delays - Use MIN_DOMAIN_DELAY (default: 10 seconds) to avoid overloading servers
Throttle requests - Reduce MAX_CONCURRENT_BROWSERS for high-volume scraping
Terms of Service - Ensure your use complies with target site's ToS
Authentication - When possible, use authorized access instead of bypassing protections

Recommended Settings for Production

# Conservative (respectful scraping)
MIN_DOMAIN_DELAY=30
MAX_CONCURRENT_BROWSERS=1

# Moderate
MIN_DOMAIN_DELAY=15
MAX_CONCURRENT_BROWSERS=2

# Aggressive (only for your own content)
MIN_DOMAIN_DELAY=5
MAX_CONCURRENT_BROWSERS=4

Production Deployment Guide

1. Proxy Support (Recommended for High-Volume)

For serious stealth, rotate through residential proxies:

# Configure proxies.txt with your proxy list
# GhostFetch will automatically rotate and track health.

Recommended proxy providers:

BrightData (datacenter/residential)
ScrapingBee (cloud-based)
Oxylabs (residential networks)
Local proxy rotation with tools like scrapy-proxy-pool

2. Caching Layer (Reduce Redundant Requests)

For repeated fetches, implement Redis caching:

import redis

cache = redis.Redis(host='localhost', port=6379)

async def fetch_with_cache(url, ttl=3600):
    cached = cache.get(url)
    if cached:
        return json.loads(cached)
    
    result = await scraper.fetch(url)
    cache.setex(url, ttl, json.dumps(result))
    return result

Docker Compose with Redis:

services:
  ghostfetch:
    build: .
    ports:
      - "8000:8000"
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

3. Security & Authentication

Add API key authentication before exposing publicly:

from fastapi import Header, HTTPException

VALID_API_KEYS = set(os.getenv("API_KEYS", "").split(","))

@app.post("/fetch")
async def fetch_endpoint(request: FetchRequest, x_api_key: str = Header(None)):
    if not x_api_key or x_api_key not in VALID_API_KEYS:
        raise HTTPException(status_code=401, detail="Invalid API key")
    # ... rest of endpoint

Usage:

curl -X POST "http://localhost:8000/fetch" \
     -H "x-api-key: your-secret-key" \
     -H "Content-Type: application/json" \
     -d '{"url": "https://example.com"}'

4. Monitoring & Observability

Log rotation (automatically configured):

Logs stored in storage/scraper.log
Max 5MB per file, keeps 5 backups
Check for errors: tail -f storage/scraper.log | grep ERROR

Database queries for analytics:

sqlite3 storage/jobs.db "SELECT status, COUNT(*) FROM jobs GROUP BY status;"

Health check monitoring:

while true; do
  curl http://localhost:8000/health | jq .
  sleep 30
done

3. Model Context Protocol (MCP)

GhostFetch includes an MCP server for integration with Claude Desktop and other MCP-aware agents.

Configuration (claude_desktop_config.json):

{
  "mcpServers": {
    "ghostfetch": {
      "command": "python",
      "args": ["-m", "ghostfetch.mcp_server"],
      "env": {
        "SYNC_TIMEOUT_DEFAULT": "120"
      }
    }
  }
}

This exposes a ghostfetch tool to the agent:

url: The URL to fetch
context_id: Optional session ID
timeout: Optional timeout (seconds)

Performance & Monitoring

Logging

Logs are written to storage/scraper.log with rotation (5MB max):

Stream output to console (INFO level)
File output with detailed format

Load Testing

Run included load tests:

# Python async load test
python scripts/load_test.py

Database

Job history is stored in storage/jobs.db (SQLite):

Persistent across restarts
Automatic cleanup of old jobs (configurable TTL)
Query jobs directly for analytics/debugging

Troubleshooting

Playwright Error: Executable doesn't exist If you see an error about the browser executable not being found, run:

playwright install chromium

Timeout Errors If fetching times out, it might be due to slow network or heavy anti-bot protections. You can try:

Increasing timeout in src/core/scraper.py (default is 60000ms)
Increasing MIN_DOMAIN_DELAY to avoid rate-limiting

Job Stuck in "Processing" Check logs in storage/scraper.log for errors. If stuck, restart the service.

GitHub Comments Not Posting Ensure:

gh CLI is installed: brew install gh (macOS) or apt install gh (Linux)
You're authenticated: gh auth login
GITHUB_REPO is set correctly
GITHUB_TOKEN is in your environment

High Memory Usage Reduce MAX_CONCURRENT_BROWSERS or MAX_REQUESTS_PER_BROWSER in configuration.

Publishing Setup

Docker Hub

To enable automated Docker image publishing:

Create a Docker Hub account and repository (your-username/ghostfetch)
Generate an access token at https://hub.docker.com/settings/security
Add these secrets to your GitHub repository:
- DOCKERHUB_USERNAME: Your Docker Hub username
- DOCKERHUB_TOKEN: Your access token

Images will be published automatically on pushes to main and version tags.

PyPI (Trusted Publishing)

To enable automated PyPI publishing:

Go to https://pypi.org/manage/account/publishing/
Add a new pending publisher:
- PyPI Project Name: ghostfetch
- Owner: iArsalanshah
- Repository: GhostFetch
- Workflow name: pypi-publish.yml
- Environment: pypi
Create a GitHub Release to trigger publishing

No API tokens needed - uses OIDC trusted publishing.

Legal Disclaimer

GhostFetch is provided for educational and research purposes only. Users are solely responsible for ensuring their use complies with:

The Terms of Service of target websites
Applicable laws regarding data access and automation (including CFAA in the US)
The robots.txt and scraping policies of target domains

This tool should not be used to:

Scrape private or authenticated content without authorization
Circumvent security measures on sites where such circumvention violates applicable law
Violate the Terms of Service of social media platforms (including X/Twitter)

The authors assume no liability for misuse of this software.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

iarsalanshah

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2026.5.25.1

May 25, 2026

2026.5.25

May 25, 2026

2026.5.16

May 16, 2026

2026.3.25

Mar 25, 2026

2026.2.10

Feb 10, 2026

2026.2.9.1

Feb 9, 2026

2026.2.9

Feb 8, 2026

2026.2.6

Feb 6, 2026

This version

2026.2.5

Feb 5, 2026

1.0.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ghostfetch-2026.2.5.tar.gz (31.8 kB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ghostfetch-2026.2.5-py3-none-any.whl (29.9 kB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file ghostfetch-2026.2.5.tar.gz.

File metadata

Download URL: ghostfetch-2026.2.5.tar.gz
Upload date: Feb 5, 2026
Size: 31.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ghostfetch-2026.2.5.tar.gz
Algorithm	Hash digest
SHA256	`4289cd5396afeeca213ee791c2065057980a3b857d9e831c2843f51e238bb84e`
MD5	`dfb2295c356a4251d64eda4e75a82851`
BLAKE2b-256	`c6070480bf9d766c34a25a83b91d1ffb2b46cf7b9f8aa8281958b12ea6d65db9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ghostfetch-2026.2.5.tar.gz:

Publisher: pypi-publish.yml on iArsalanshah/GhostFetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ghostfetch-2026.2.5.tar.gz
- Subject digest: 4289cd5396afeeca213ee791c2065057980a3b857d9e831c2843f51e238bb84e
- Sigstore transparency entry: 919935629
- Sigstore integration time: Feb 5, 2026
Source repository:
- Permalink: iArsalanshah/GhostFetch@d8eb6b29400ef63117edecfb965ff86b033e5763
- Branch / Tag: refs/tags/2026.2.5
- Owner: https://github.com/iArsalanshah
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@d8eb6b29400ef63117edecfb965ff86b033e5763
- Trigger Event: release

File details

Details for the file ghostfetch-2026.2.5-py3-none-any.whl.

File metadata

Download URL: ghostfetch-2026.2.5-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 29.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ghostfetch-2026.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a3b1762c35f2f349bd770c5e22f3210d10b4c77b44f2ba0b6c358430cf352ee`
MD5	`f6463af814701650ef4818b246392213`
BLAKE2b-256	`b94023db4aabbb38647ad58488288de1fc45be75eeb8b9355b75494147bae943`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ghostfetch-2026.2.5-py3-none-any.whl:

Publisher: pypi-publish.yml on iArsalanshah/GhostFetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ghostfetch-2026.2.5-py3-none-any.whl
- Subject digest: 0a3b1762c35f2f349bd770c5e22f3210d10b4c77b44f2ba0b6c358430cf352ee
- Sigstore transparency entry: 919935634
- Sigstore integration time: Feb 5, 2026
Source repository:
- Permalink: iArsalanshah/GhostFetch@d8eb6b29400ef63117edecfb965ff86b033e5763
- Branch / Tag: refs/tags/2026.2.5
- Owner: https://github.com/iArsalanshah
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@d8eb6b29400ef63117edecfb965ff86b033e5763
- Trigger Event: release

ghostfetch 2026.2.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

GhostFetch

Features

Quick Start

For AI Agents (Simplest)

For API Usage

Installation

Option 1: Docker Hub (Fastest)

Option 2: pip install

Option 3: Manual Setup

Usage

1. CLI Mode (Zero Setup)

2. API Mode (Service for Agents)

API Endpoints

Synchronous Fetch (Recommended for AI Agents)

Async Fetch (For Background Processing)

Check Job Status

Health Check

Integration Examples

Python Agent with Job Polling

Using Webhook Callbacks

GitHub Integration

Integration with AI Agents

Configuration

Environment Variables

Docker Environment

Specific Handling

⚠️ Important: Rate Limiting & Ethics

Recommended Settings for Production

Production Deployment Guide

1. Proxy Support (Recommended for High-Volume)

2. Caching Layer (Reduce Redundant Requests)

3. Security & Authentication

4. Monitoring & Observability

3. Model Context Protocol (MCP)

Performance & Monitoring

Logging

Load Testing

Database

Troubleshooting

Publishing Setup

Docker Hub

PyPI (Trusted Publishing)

Legal Disclaimer

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance