Skip to main content

SERP scraper with undetected-chromedriver, proxy rotation, and stealth browsing

Project description

⚠️ WARNING: This project is in early development stages.

SERP Scraper

A powerful, async Python library for scraping Google and Bing Search Engine Results Pages (SERPs) with proxy rotation, intelligent caching, and stealth browsing.

Python 3.10+ License: MIT

Features

  • Dual Search Methods: Browser-based (nodriver) and HTTP-based (httpx) scraping
  • Google News RSS: Scrape news articles via Google News RSS feeds
  • Proxy Rotation: DataImpulse and custom proxy support with automatic rotation
  • Intelligent Caching: Disk-based caching with configurable TTL
  • CAPTCHA Handling: Automatic detection with retry logic and exponential backoff
  • Type Safety: Full type annotations with Pydantic validation
  • Async/Await: Modern asynchronous API design
  • Environment Config: .env file support for configuration
  • CLI Tool: Interactive command-line interface for testing

Installation

Basic Installation

pip install serp-scraper

From Source

git clone https://github.com/neuronaline/serp-scraper.git
cd serp-scraper
pip install -e .

With Dependencies

pip install serp-scraper[dev]  # With dev tools
pip install serp-scraper[test] # With test dependencies

Requirements

  • Python 3.10 or higher
  • Google Chrome browser installed

Quick Start

Recommended: Using SerpClient

import asyncio
from serp import SerpClient

async def main():
    async with SerpClient() as client:
        results = await client.search("python programming")

        for r in results:
            print(f"{r.rank}. {r.title}")
            print(f"   {r.url}")
            print(f"   {r.description[:100]}...")

asyncio.run(main())

Quick Functions

For simple use cases without creating a client:

import asyncio
from serp import quick_search, quick_fetch

async def main():
    # Search
    results = await quick_search("web scraping")
    print(f"Found {len(results)} results")

    # Fetch URL
    content = await quick_fetch("https://example.com")
    print(content[:500])

asyncio.run(main())

Google News RSS

Scrape news articles using Google News RSS feeds:

import asyncio
from serp import GoogleNewsClient

async def main():
    async with GoogleNewsClient(language="en", country="US") as client:
        news = await client.get_news("Tesla", max_results=20)

        for r in news:
            print(f"{r.title}")
            print(f"  Source: {r.source}")
            print(f"  URL: {r.url}")
            print(f"  Date: {r.published}")

asyncio.run(main())

Or use the quick function:

import asyncio
from serp import quick_news

async def main():
    news = await quick_news("Tesla", language="en", country="US", max_results=20)
    print(f"Found {len(news)} news articles")

asyncio.run(main())

URL Fetching

import asyncio
from serp import SerpClient

async def main():
    async with SerpClient() as client:
        # Fetch page content as Markdown
        content = await client.fetch("https://example.com")
        print(content)

asyncio.run(main())

Using with Configuration

import asyncio
from serp import SerpClient, SerpConfig

# Create configured client
config = SerpConfig(
    log_level="DEBUG",
    max_retries=5,
    cache_ttl=3600,  # 1 hour
    cache_enabled=True,
)

async with SerpClient(config) as client:
    results = await client.search("python tutorial")

Environment Variables (.env file)

Create a .env file in your project. Copy from .env.example for all options:

# DataImpulse Proxy (recommended)
SERP_DATAIMPULSE_GATEWAY=http://gw.dataimpulse.com:10001
SERP_DATAIMPULSE_USER=your_username
SERP_DATAIMPULSE_PASS=your_password

# Custom Proxies (comma-separated)
SERP_CUSTOM_PROXIES=http://user:pass@proxy1.com:8080,socks5://proxy2.com:1080

# Proxy Strategy: "random" or "dataimpulse_first"
SERP_PROXY_STRATEGY=dataimpulse_first

# Logging
SERP_LOG_LEVEL=WARNING
SERP_DEBUG=false

# Cache
SERP_CACHE_ENABLED=true
SERP_CACHE_DIR=.cache/serp
SERP_CACHE_TTL=86400

# Search
SERP_DEFAULT_SOURCE=auto  # "google", "bing", or "auto"
SERP_HEADLESS=false
SERP_TIMEOUT=30

# Retry
SERP_MAX_RETRIES=3
SERP_RETRY_DELAY_MIN=0.5
SERP_RETRY_DELAY_MAX=2.0
SERP_EXPONENTIAL_BACKOFF=false

# Custom User Agent (optional)
SERP_USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36

Configuration

SerpConfig Options

Parameter Type Default Description
custom_proxies str "" Comma-separated proxy URLs from env
proxy_strategy str "dataimpulse_first" Proxy selection: "random" or "dataimpulse_first"
dataimpulse_gateway str None DataImpulse gateway URL
dataimpulse_user str None DataImpulse username
dataimpulse_pass str None DataImpulse password
log_level str "WARNING" Logging level (DEBUG, INFO, WARNING, ERROR)
max_retries int 3 Maximum retry attempts (1-10)
retry_delay_min float 0.5 Minimum retry delay in seconds
retry_delay_max float 2.0 Maximum retry delay in seconds
exponential_backoff bool false Use exponential backoff
timeout int 30 Request timeout in seconds (5-120)
cache_enabled bool true Enable/disable caching
cache_dir str ".cache/serp" Cache directory path
cache_ttl int 86400 Cache TTL in seconds (min 60)
default_source str "auto" Default search source: "google", "bing", or "auto"
headless bool false Run browser in headless mode
user_agent str None Custom user agent string

Environment Variables

Variable Description Default
SERP_DATAIMPULSE_GATEWAY DataImpulse gateway URL -
SERP_DATAIMPULSE_USER DataImpulse username -
SERP_DATAIMPULSE_PASS DataImpulse password -
SERP_CUSTOM_PROXIES Comma-separated proxy URLs -
SERP_PROXY_STRATEGY Proxy selection strategy dataimpulse_first
SERP_LOG_LEVEL Logging level WARNING
SERP_CACHE_DIR Cache directory path .cache/serp
SERP_CACHE_TTL Default cache TTL in seconds 86400
SERP_CACHE_ENABLED Enable/disable caching true
SERP_MAX_RETRIES Maximum retry attempts 3
SERP_RETRY_DELAY_MIN Minimum retry delay (seconds) 0.5
SERP_RETRY_DELAY_MAX Maximum retry delay (seconds) 2.0
SERP_EXPONENTIAL_BACKOFF Use exponential backoff false
SERP_TIMEOUT Request timeout in seconds 30
SERP_DEBUG Enable debug logging false
SERP_DOTENV_FILE Path to .env file Auto-detect

API Reference

SerpClient

The recommended high-level interface for using the library.

from serp import SerpClient

client = SerpClient(
    headless=False,              # Optional
    use_cache=True,             # Optional
    cache_ttl=86400,            # Optional
    source=None,                # Optional: "google", "bing", or None (auto)
    max_retries=3,              # Optional
    timeout=30,                 # Optional
    log_level="WARNING",        # Optional
)

Methods

client.search(query, page_num=1, method=None, source=None, use_cache=None)

Search for a query and return results.

Parameters:

  • query (str): Search query string
  • page_num (int): Page number (1-based), defaults to 1
  • method (str): Search method - "browser" (nodriver), "http" (httpx), or None (auto)
  • source (str): Search engine - "google", "bing", or None (auto: google first, bing fallback)
  • use_cache (bool): Whether to use cache. None uses client default.

Returns:

  • list[SearchResult]: List of SearchResult objects

Raises:

  • ProxyError: All proxies failed
  • CaptchaError: CAPTCHA detected after all retries
  • PageTimeoutError: Page load timeout
  • ParseError: Failed to parse results

client.fetch(url, use_cache=None, prefer_browser=True)

Fetch a URL and return content as Markdown.

Parameters:

  • url (str): Target URL
  • use_cache (bool): Whether to use cache. None uses client default.
  • prefer_browser (bool): If True, use browser directly. If False, try HTTP first then fallback to browser.

Returns:

  • str: Page content converted to Markdown

SearchResult

Typed result object returned by search operations.

from serp import SearchResult

result = SearchResult(
    rank=1,
    title="Example Title",
    url="https://example.com",
    description="Example description...",
    source="google"  # or "bing"
)

Attributes:

  • rank (int): Position in search results (1-based)
  • title (str): Result title
  • url (str): Target URL
  • description (str): Result snippet/description
  • source (str): Search engine source ("google" or "bing")

Methods:

  • to_dict(): Convert to dictionary for backward compatibility

Quick Functions

Module-level convenience functions using default client:

from serp import quick_search, quick_fetch, quick_search_http

# Quick search (auto method)
results = await quick_search("query")

# HTTP-based search only
results = await quick_search_http("query")

# Fetch URL
content = await quick_fetch("https://example.com")

GoogleNewsClient

Client for scraping Google News via RSS feeds.

from serp import GoogleNewsClient

client = GoogleNewsClient(
    language="tr",        # Language code (tr, en, etc.)
    country="TR",        # Country code (TR, US, etc.)
    time_range="d",      # Time range: "h" (hour), "d" (day), "w" (week), "m" (month)
)

Methods

client.get_news(query, max_results=50, queries=None)

Get news articles for a search term.

Parameters:

  • query (str): Search term to find news for
  • max_results (int): Maximum number of results (default: 50)
  • queries (list[str]): Custom list of queries to use

Returns:

  • list[NewsResult]: List of NewsResult objects
quick_news(query, max_results=50, language="tr", country="TR")

Convenience function for quick news retrieval.

from serp import quick_news

news = await quick_news("Tesla", max_results=20, language="en", country="US")

NewsResult

Typed result object for Google News articles.

from serp import NewsResult

result = NewsResult(
    title="Tesla announces new model",
    url="https://news.google.com/rss/articles/...",
    original_url="https://example.com/article",
    published=datetime(2026, 5, 11, 8, 0, 0),
    source="BBC",
    description="Tesla unveiled...",
    query="Tesla"
)

Attributes:

  • title (str): News headline
  • url (str): Google News RSS URL
  • original_url (str): Original article URL (extracted from description)
  • published (datetime): Publication date
  • source (str): News source name (e.g., "BBC", "NTV")
  • description (str): News summary/snippet
  • query (str): Search query that returned this result

Methods:

  • to_dict(): Convert to dictionary

NewsSettings

Configuration for Google News scraping.

from serp import NewsSettings

settings = NewsSettings(
    language="tr",    # Language code
    country="TR",     # Country code
    time_range="d",   # Time range: "h", "d", "w", "m"
)

Attributes:

  • language (str): Language code (default: "tr")
  • country (str): Country code (default: "TR")
  • time_range (str): Time range filter (default: "d")

Utility Functions

set_log_level(level)

Set the log level for all serp loggers.

from serp import set_log_level

set_log_level("DEBUG")  # Enable debug logging
set_log_level("WARNING")  # Only show warnings and errors

Exceptions

Exception Description
ProxyError All proxies failed
CaptchaError CAPTCHA could not be solved after retries
PageTimeoutError Page load timeout
ParseError Failed to parse results

Constants

Constant Description
MAX_RETRIES Maximum retry attempts (default: 3)
TIMEOUT_MS Page timeout in milliseconds (default: 30000)
USER_AGENTS List of user agent strings for rotation

Interactive CLI

The package includes an interactive CLI tool for testing:

python main.py

Features:

  • SERP Search testing
  • URL Fetch testing
  • Google News RSS testing
  • Proxy status checking

Project Structure

serp-scraper/
├── serp/                    # Main package
│   ├── __init__.py          # Exports and API
│   ├── client.py            # SerpClient and quick functions
│   ├── config.py            # Configuration constants
│   ├── config_pydantic.py   # Pydantic-based configuration
│   ├── types.py             # Type definitions (SearchResult, etc.)
│   ├── google_news.py       # Google News RSS client
│   ├── search.py            # Browser-based search
│   ├── fetch.py             # URL fetch functionality
│   ├── simple.py            # HTTP-based search
│   ├── parsers.py           # Result parsing logic
│   ├── cache.py             # Disk-based caching
│   └── utils.py             # Utilities and helpers
├── tests/                   # Test suite
│   ├── conftest.py          # Test fixtures
│   ├── test_serp.py
│   ├── test_google_news.py
│   └── test_cache.py
├── main.py                  # Interactive CLI tool
├── .env.example            # Environment variables template
├── pyproject.toml          # Project metadata
└── README.md               # This file

Testing

Run the test suite:

pytest

Run with coverage:

pytest --cov=serp --cov-report=html

Run specific test file:

pytest tests/test_serp.py

Architecture

Search Flow

  1. Check cache for existing results (if use_cache=True)
  2. Load proxy configuration from file or environment
  3. Select random proxy (or DataImpulse if configured)
  4. Create browser with stealth settings (browser method) or use HTTP client (http method)
  5. Navigate to search URL
  6. Wait for results to load
  7. Check for CAPTCHA
  8. Parse organic results
  9. Cache results before returning

Search Methods

Browser Method (method="browser"):

  • Uses nodriver for stealth Chrome automation
  • More reliable, harder to detect
  • Slower due to browser overhead

HTTP Method (method="http"):

  • Uses httpx for direct HTTP requests
  • Faster, less resource intensive
  • May be blocked more easily

Caching

The caching system uses a disk-based approach:

  • Cache entries stored as JSON files in .cache/serp/
  • Keys are SHA256 hashes of query parameters
  • Automatic expiration based on TTL
  • Can be disabled via cache_enabled=False or SERP_CACHE_ENABLED=false

Error Handling

The library provides specific exceptions for different failure modes:

  • ProxyError: All configured proxies failed or returned errors
  • CaptchaError: Search engine detected automation and presented CAPTCHA
  • PageTimeoutError: Page did not load within the timeout period
  • ParseError: Page loaded but results could not be parsed

Dependencies

Package Version Purpose
nodriver >=4.0.0 Stealth Chrome automation
markdownify >=0.12.0 HTML to Markdown conversion
httpx >=0.25.0 Async HTTP client
beautifulsoup4 >=4.12.0 HTML parsing
pydantic >=2.0.0 Configuration validation
python-dotenv >=1.0.0 .env file support

Development Dependencies

Package Version Purpose
pytest >=7.0.0 Testing framework
pytest-asyncio >=0.21.0 Async test support
pytest-cov >=4.0.0 Coverage reporting
pytest-mock >=3.10.0 Mocking utilities
pytest-httpserver >=1.0.0 HTTP server for testing
ruff >=0.1.0 Linting
mypy >=1.0.0 Type checking
build >=1.0.0 Package building
twine >=4.0.0 Package publishing

License

MIT License - see LICENSE file for details.

Author

neuronaline - flashneuron@proton.me

Links

Disclaimer

This software is provided for educational and legitimate purposes only. Users are responsible for ensuring their use complies with search engine Terms of Service and applicable laws. The authors assume no liability for misuse of this software.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serp_scraper-2.0.1.tar.gz (48.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

serp_scraper-2.0.1-py3-none-any.whl (39.2 kB view details)

Uploaded Python 3

File details

Details for the file serp_scraper-2.0.1.tar.gz.

File metadata

  • Download URL: serp_scraper-2.0.1.tar.gz
  • Upload date:
  • Size: 48.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for serp_scraper-2.0.1.tar.gz
Algorithm Hash digest
SHA256 0953ba48c153e4b4b568829d44d3c66a75e1b67b882c4a8f4d08900884fcac27
MD5 770c45b34757158cffcd5ba99a5ca2f4
BLAKE2b-256 5691c5a8c20b39789ce7f420783b8839ee2b221dbb2dd9a0830ac7b9ca0e5ddd

See more details on using hashes here.

File details

Details for the file serp_scraper-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: serp_scraper-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 39.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for serp_scraper-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 358cabff2c3f18926cdd9df10fb4b6b7ca9132ad9b74e8298cb5a55abb0c5752
MD5 abe859aacab5af629076e837d46fe4c7
BLAKE2b-256 7db4cd8a97c603ec16ade7807dc8a6034d5f1426c9985dff4f10036d40164659

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page