SERP scraper with nodriver, proxy rotation, and stealth browsing

These details have not been verified by PyPI

Project links

Project description

⚠️ WARNING: This project is in early development stages.

SERP Scraper

A powerful, async Python library for scraping Google and Bing Search Engine Results Pages (SERPs) with proxy rotation, intelligent caching, and stealth browsing.

Features

Dual Search Methods: Browser-based (nodriver) and HTTP-based (httpx) scraping
Google News RSS: Scrape news articles via Google News RSS feeds
Google Scholar: Search academic papers from Google Scholar
Proxy Rotation: DataImpulse and custom proxy support with automatic rotation
Intelligent Caching: Disk-based caching with configurable TTL
CAPTCHA Handling: Automatic detection with retry logic and exponential backoff
Type Safety: Full type annotations with Pydantic validation
Async/Await: Modern asynchronous API design
Environment Config: .env file support for configuration
CLI Tool: Interactive command-line interface for testing
Dual Output Format: Text (human-readable) and JSON (LLM-friendly) output
REST API: Optional FastAPI-based REST API with rate limiting and authentication
Content Compression: Built-in compress_content() utility for truncating long content into head, middle, and tail portions — available both in the core library and the REST API

Installation

Basic Installation

pip install serp-scraper

From Source

git clone https://github.com/neuronaline/serp-scraper.git
cd serp-scraper
pip install -e .

With Dependencies

pip install serp-scraper[dev]  # With dev tools
pip install serp-scraper[test] # With test dependencies
pip install serp-scraper[api]  # With REST API (FastAPI)

Requirements

Python 3.10 or higher
Google Chrome browser installed
Virtual display (for non-headless mode, the default):
- Linux/headless servers: Install Xvfb (sudo apt install xvfb) and use DISPLAY=:99 or run with xvfb-run
- macOS: No additional setup needed (has built-in display)
- Windows: No additional setup needed (has built-in display)

Quick Start

Recommended: Using SerpClient

import asyncio
from serp import SerpClient

async def main():
    async with SerpClient() as client:
        results = await client.search("python programming")

        for r in results:
            print(f"{r.rank}. {r.title}")
            print(f"   {r.url}")
            print(f"   {r.description[:100]}...")

asyncio.run(main())

Running on Headless Servers

By default, the browser runs in non-headless mode (visible window) which requires a display. For headless servers or CI/CD environments, use one of these approaches:

Option 1: Use xvfb-run (recommended for Linux):

xvfb-run -a python your_script.py

Option 2: Set DISPLAY environment variable:

DISPLAY=:99 python your_script.py

Option 3: Run in headless mode:

async with SerpClient(headless=True) as client:
    results = await client.search("python programming")

Or via environment variable:

SERP_HEADLESS=true python your_script.py

Note: The VirtualScreenRequiredError exception is raised when running non-headless without a display.

Google News RSS

Scrape news articles using Google News RSS feeds:

import asyncio
from serp import GoogleNewsClient

async def main():
    async with GoogleNewsClient(language="en", country="US") as client:
        news = await client.get_news("Tesla", max_results=20)

        for r in news:
            print(f"{r.title}")
            print(f"  Source: {r.source}")
            print(f"  URL: {r.url}")
            print(f"  Date: {r.published}")

asyncio.run(main())

Or use the quick function:

import asyncio
from serp import quick_news

async def main():
    news = await quick_news("Tesla", language="en", country="US", max_results=20)
    print(f"Found {len(news)} news articles")

asyncio.run(main())

URL Fetching

import asyncio
from serp import SerpClient

async def main():
    async with SerpClient() as client:
        # Fetch page content as Markdown
        # Automatically detects JavaScript and uses browser if needed
        content = await client.fetch("https://example.com")
        print(content)

asyncio.run(main())

Fetch Strategy:

Static pages (no JavaScript): Uses fast HTTP + BeautifulSoup4
JavaScript-detected pages: Automatically uses browser (nodriver) for execution
Failed/incomplete fetch: Falls back to browser
Full page load guarantee: Browser waits for load event + 1s for JS rendering

Content Compression

Compress long content into head, middle, and tail portions — useful for LLM contexts where you want to retain key information while reducing token count:

from serp import compress_content, CompressionMeta

long_text = "..."  # e.g. 20,000+ characters
compressed, meta = compress_content(long_text)

if meta.was_truncated:
    print(f"Reduced from {meta.original_length:,} to {meta.compressed_length:,} chars")
    print(f"Truncated {meta.truncated_chars:,} characters")

Compression can also be enabled directly during fetch:

content = await client.fetch("https://example.com", compress=True)

The standalone compress_content() function gives you full control over thresholds and metadata, while the compress=True parameter on fetch() is a convenience shortcut.

Quick Functions

For simple use cases without creating a client:

import asyncio
from serp import quick_search, quick_fetch

async def main():
    # Search
    results = await quick_search("web scraping")
    print(f"Found {len(results)} results")

    # Fetch URL (with optional compression)
    content = await quick_fetch("https://example.com", compress=True)
    print(content[:500])

asyncio.run(main())

Using with Configuration

import asyncio
from serp import SerpClient, SerpConfig

# Create configured client
config = SerpConfig(
    log_level="DEBUG",
    max_retries=5,
    cache_ttl=3600,  # 1 hour
    cache_enabled=True,
)

async with SerpClient(config) as client:
    results = await client.search("python tutorial")

Environment Variables (.env file)

Create a .env file in your project. Copy from .env.example for all options:

# DataImpulse Proxy (recommended)
SERP_DATAIMPULSE_GATEWAY=http://gw.dataimpulse.com:10001
SERP_DATAIMPULSE_USER=your_username
SERP_DATAIMPULSE_PASS=your_password

# Custom Proxies (comma-separated)
SERP_CUSTOM_PROXIES=http://user:pass@proxy1.com:8080,socks5://proxy2.com:1080

# Proxy Strategy: "random" or "dataimpulse_first"
SERP_PROXY_STRATEGY=dataimpulse_first

# Logging
SERP_LOG_LEVEL=WARNING
SERP_DEBUG=false

# Cache
SERP_CACHE_ENABLED=true
SERP_CACHE_DIR=.cache/serp
SERP_CACHE_TTL=86400

# Search
SERP_DEFAULT_SOURCE=auto  # "google", "bing", or "auto"
SERP_HEADLESS=false
SERP_TIMEOUT=30

# Retry
SERP_MAX_RETRIES=3
SERP_RETRY_DELAY_MIN=0.5
SERP_RETRY_DELAY_MAX=2.0
SERP_EXPONENTIAL_BACKOFF=false

# Custom User Agent (optional)
SERP_USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36

Configuration

SerpConfig Options

Parameter	Type	Default	Description
Proxy Settings
`dataimpulse_gateway`	str	`None`	DataImpulse gateway URL
`dataimpulse_user`	str	`None`	DataImpulse username
`dataimpulse_pass`	str	`None`	DataImpulse password
`dataimpulse_protocol`	str	`"http"`	DataImpulse proxy protocol ("http" or "socks5")
`dataimpulse_country`	str	`None`	DataImpulse country code (optional)
`dataimpulse_sessid`	str	`None`	DataImpulse session ID for sticky proxy (optional)
`dataimpulse_sessttl`	int	`None`	DataImpulse session TTL in minutes (optional)
`custom_proxies`	str	`""`	Comma-separated proxy URLs
`proxy_strategy`	str	`"dataimpulse_first"`	Proxy selection: "random" or "dataimpulse_first"
Cache Settings
`cache_enabled`	bool	`true`	Enable/disable caching
`cache_dir`	str	`".cache/serp"`	Cache directory path
`cache_ttl`	int	`86400`	Cache TTL in seconds (min 60)
Retry Settings
`max_retries`	int	`3`	Maximum retry attempts (1-10)
`retry_delay_min`	float	`0.5`	Minimum retry delay in seconds
`retry_delay_max`	float	`2.0`	Maximum retry delay in seconds
`exponential_backoff`	bool	`false`	Use exponential backoff
Search Settings
`default_source`	str	`"auto"`	Default search source: "google", "bing", or "auto"
`headless`	bool	`false`	Run browser in headless mode (requires virtual display when false)
`timeout`	int	`30`	Request timeout in seconds (5-120)
`user_agent`	str	`None`	Custom user agent string
Logging
`log_level`	str	`"WARNING"`	Logging level (DEBUG, INFO, WARNING, ERROR)

Environment Variables

Variable	Description	Default
`SERP_DATAIMPULSE_GATEWAY`	DataImpulse gateway URL	-
`SERP_DATAIMPULSE_USER`	DataImpulse username	-
`SERP_DATAIMPULSE_PASS`	DataImpulse password	-
`SERP_DATAIMPULSE_PROTOCOL`	DataImpulse proxy protocol	`http`
`SERP_DATAIMPULSE_COUNTRY`	DataImpulse country code (optional)	-
`SERP_DATAIMPULSE_SESSID`	DataImpulse session ID for sticky proxy (optional)	-
`SERP_DATAIMPULSE_SESSTTL`	DataImpulse session TTL in minutes (optional)	-
`SERP_CUSTOM_PROXIES`	Comma-separated proxy URLs	-
`SERP_PROXY_STRATEGY`	Proxy selection strategy	`dataimpulse_first`
`SERP_LOG_LEVEL`	Logging level	`WARNING`
`SERP_CACHE_DIR`	Cache directory path	`.cache/serp`
`SERP_CACHE_TTL`	Default cache TTL in seconds	`86400`
`SERP_CACHE_ENABLED`	Enable/disable caching	`true`
`SERP_MAX_RETRIES`	Maximum retry attempts	`3`
`SERP_RETRY_DELAY_MIN`	Minimum retry delay (seconds)	`0.5`
`SERP_RETRY_DELAY_MAX`	Maximum retry delay (seconds)	`2.0`
`SERP_EXPONENTIAL_BACKOFF`	Use exponential backoff	`false`
`SERP_TIMEOUT`	Request timeout in seconds	`30`
`SERP_DEBUG`	Enable debug logging	`false`
`SERP_DOTENV_FILE`	Path to .env file	Auto-detect
API Settings
`API_HOST`	API server host	`0.0.0.0`
`API_PORT`	API server port	`8000`
`API_DEBUG`	API debug mode	`false`
`API_MAX_CONCURRENT_REQUESTS`	Max concurrent requests (pool size)	`15`
`API_REQUEST_TIMEOUT`	Request timeout in seconds	`60`
`API_RATE_LIMIT_SEARCH`	Rate limit for /search (req/min)	`30`
`API_RATE_LIMIT_FETCH`	Rate limit for /fetch (req/min)	`60`
`API_RATE_LIMIT_NEWS`	Rate limit for /news (req/min)	`30`
`API_RATE_LIMIT_SCHOLAR`	Rate limit for /scholar (req/min)	`30`
`API_RATE_LIMIT_DEFAULT`	Default rate limit (req/min)	`100`
`API_KEYS_HASHED`	Comma-separated hashed API keys	-
`API_ALLOW_NO_AUTH`	Allow unauthenticated access	`false`
`API_LOG_LEVEL`	API logging level	`INFO`
`API_LOG_DIR`	API log directory	`logs`
`API_LOG_RETENTION_DAYS`	Log retention days	`7`
`API_CORS_ORIGINS`	CORS allowed origins (comma-separated)	-

API Reference

SerpClient

The recommended high-level interface for using the library.

from serp import SerpClient

client = SerpClient(
    headless=False,              # Optional
    use_cache=True,             # Optional
    cache_ttl=86400,            # Optional
    source=None,                # Optional: "google", "bing", or None (auto)
    max_retries=3,              # Optional
    timeout=30,                 # Optional
    log_level="WARNING",        # Optional
)

Methods

`client.search(query, page_num=1, source=None, use_cache=None)`

Search for a query and return results.

Note: Google and Bing SERP searches always use browser (nodriver) because these search engines require JavaScript to render results. The method parameter is deprecated and ignored.

Parameters:

query (str): Search query string
page_num (int): Page number (1-based), defaults to 1
source (str): Search engine - "google", "bing", or None (auto: google first, bing fallback)
use_cache (bool): Whether to use cache. None uses client default.

Returns:

list[SearchResult]: List of SearchResult objects

Raises:

ProxyError: All proxies failed
CaptchaError: CAPTCHA detected after all retries
PageTimeoutError: Page load timeout
ParseError: Failed to parse results

`client.fetch(url, use_cache=None, prefer_browser=False, compress=False)`

Fetch a URL and return content as Markdown.

Fetch Strategy:

Static pages (no JavaScript): Uses fast HTTP + BeautifulSoup4 - low resource usage, fast execution
JavaScript detected in HTML: Immediately falls back to browser (nodriver) for execution
Failed/incomplete fetch: Falls back to browser (handles CAPTCHA, anti-bot measures)
Full page load guarantee: Browser waits for load event + 1s additional wait for JS rendering

Parameters:

url (str): Target URL
use_cache (bool): Whether to use cache. None uses client default.
prefer_browser (bool): If True, use browser directly (legacy mode). If False (default), try BS4 first then fallback to browser.
compress (bool): If True and content exceeds ~10K chars, compress by taking head, middle, and tail portions. (Default: False)

Returns:

str: Page content converted to Markdown (optionally compressed)

Note: When you need metadata about the truncation (original length, etc.), use the standalone compress_content() function from the serp package instead of the compress parameter on fetch().

SearchResult

Typed result object returned by search operations.

from serp import SearchResult

result = SearchResult(
    rank=1,
    title="Example Title",
    url="https://example.com",
    description="Example description...",
    source="google"  # or "bing"
)

Attributes:

rank (int): Position in search results (1-based)
title (str): Result title
url (str): Target URL
description (str): Result snippet/description
source (str): Search engine source ("google" or "bing")

Methods:

to_dict(): Convert to dictionary for backward compatibility

Quick Functions

Module-level convenience functions using default client:

from serp import quick_search, quick_fetch, quick_search_http

# Quick search (auto method)
results = await quick_search("query")

# HTTP-based search only
results = await quick_search_http("query")

# Fetch URL (with optional compression)
content = await quick_fetch("https://example.com", compress=True)

GoogleNewsClient

Client for scraping Google News via RSS feeds.

from serp import GoogleNewsClient

client = GoogleNewsClient(
    language="tr",        # Language code (tr, en, etc.)
    country="TR",        # Country code (TR, US, etc.)
    time_range="d",      # Time range: "h" (hour), "d" (day), "w" (week), "m" (month)
)

Methods

`client.get_news(query, max_results=50, queries=None)`

Get news articles for a search term.

Parameters:

query (str): Search term to find news for
max_results (int): Maximum number of results (default: 50)
queries (list[str]): Custom list of queries to use

Returns:

list[NewsResult]: List of NewsResult objects

`quick_news(query, max_results=50, language="tr", country="TR")`

Convenience function for quick news retrieval.

from serp import quick_news

news = await quick_news("Tesla", max_results=20, language="en", country="US")

NewsResult

Typed result object for Google News articles.

from serp import NewsResult

result = NewsResult(
    title="Tesla announces new model",
    url="https://news.google.com/rss/articles/...",
    original_url="https://example.com/article",
    published=datetime(2026, 5, 11, 8, 0, 0),
    source="BBC",
    description="Tesla unveiled...",
    query="Tesla"
)

Attributes:

title (str): News headline
url (str): Google News RSS URL
original_url (str): Original article URL (extracted from description)
published (datetime): Publication date
source (str): News source name (e.g., "BBC", "NTV")
description (str): News summary/snippet
query (str): Search query that returned this result

Methods:

to_dict(): Convert to dictionary

NewsSettings

Configuration for Google News scraping.

from serp import NewsSettings

settings = NewsSettings(
    language="tr",    # Language code
    country="TR",     # Country code
    time_range="d",   # Time range: "h", "d", "w", "m"
)

Attributes:

language (str): Language code (default: "tr")
country (str): Country code (default: "TR")
time_range (str): Time range filter (default: "d")

OutputFormatter

Structured output formatting for CLI tools and LLM integration.

from serp import OutputFormatter, OUTPUT_TEXT, OUTPUT_JSON

# Format search results
output = OutputFormatter.format_search_results(
    results=search_results,
    mode=OUTPUT_JSON,
    query="python tutorial",
    source="google",
)

# Format news results
output = OutputFormatter.format_news(
    news_list=news_results,
    mode=OUTPUT_JSON,
    search_term="Tesla",
    language="en",
    country="US",
)

# Format fetch results
output = OutputFormatter.format_fetch(
    content=page_content,
    url="https://example.com",
    char_count=len(page_content),
    mode=OUTPUT_JSON,
)

Constants:

OUTPUT_TEXT: Text output mode (human-readable)
OUTPUT_JSON: JSON output mode (machine-parseable)

Classes:

OutputFormatter: Main interface with static format_* methods
TextFormatter: Human-readable text formatting
JSONFormatter: JSON output formatting
OutputError: Structured error representation

Utility Functions

`set_log_level(level)`

Set the log level for all serp loggers.

from serp import set_log_level

set_log_level("DEBUG")  # Enable debug logging
set_log_level("WARNING")  # Only show warnings and errors

`compress_content(content, threshold=10000, head_pct=0.35, middle_pct=0.15, tail_pct=0.50)`

Compress long content by extracting head, middle, and tail portions, joined with a truncation marker.

from serp import compress_content, CompressionMeta

compressed, meta = compress_content(
    content,
    threshold=10000,      # Content longer than this gets compressed
    head_pct=0.35,        # 35% of target for the head
    middle_pct=0.15,      # 15% of target for the middle
    tail_pct=0.50,        # 50% of target for the tail
)

if meta.was_truncated:
    print(f"Original: {meta.original_length:,} chars")
    print(f"Compressed: {meta.compressed_length:,} chars")
    print(f"Truncated: {meta.truncated_chars:,} chars")
else:
    print("Content was within threshold, returned as-is")

Parameters:

Param	Type	Default	Description
`content`	str	-	The content string to compress
`threshold`	int	`10000`	Char length threshold. Content shorter than this is unchanged.
`head_pct`	float	`0.35`	Fraction of target length for the head portion
`middle_pct`	float	`0.15`	Fraction of target length for the middle portion
`tail_pct`	float	`0.50`	Fraction of target length for the tail portion

Returns:

tuple[str, CompressionMeta]: (compressed_content, metadata)

CompressionMeta attributes:

Attribute	Type	Description
`original_length`	int	Character count before compression
`compressed_length`	int	Character count after compression
`truncated_chars`	int	Number of characters removed
`was_truncated`	bool	Whether content exceeded threshold and was truncated

Exceptions

Exception	Description
`ProxyError`	All proxies failed
`CaptchaError`	CAPTCHA could not be solved after retries
`PageTimeoutError`	Page load timeout
`ParseError`	Failed to parse results
`VirtualScreenRequiredError`	No virtual display available for non-headless mode

Constants

Constant	Description
`MAX_RETRIES`	Maximum retry attempts (default: 3)
`TIMEOUT_MS`	Page timeout in milliseconds (default: 30000)
`USER_AGENTS`	List of user agent strings for rotation

Interactive CLI

The package includes an interactive CLI tool for testing:

python main.py
python main.py --format json   # JSON output for LLM integration
python main.py -f text         # Human-readable text output (default)
python main.py --compress      # Enable content compression for URL fetch
python main.py --compress -f json  # Compress + JSON output

Features:

SERP Search testing
URL Fetch testing (with optional --compress flag for long content)
Google News RSS testing
Dual output format (text/JSON)
Proxy status checking

REST API (Optional)

The package includes an optional FastAPI-based REST API for programmatic access.

Installation

pip install serp-scraper[api]

Running the API

python -m api.main
# Or with uvicorn directly:
uvicorn api.main:app --host 0.0.0.0 --port 8000

API Endpoints

GET /health - Health check
POST /api/v1/search - Search SERP results
POST /api/v1/fetch - Fetch URL content (supports optional compress field)
POST /api/v1/news - Get Google News articles
POST /api/v1/scholar - Search Google Scholar articles

Authentication

The API uses API key authentication. Set the X-API-Key header in your requests.

Rate Limiting

The API includes rate limiting middleware to prevent abuse.

Project Structure

serp-scraper/
├── serp/                    # Core library
│   ├── __init__.py          # Public API exports
│   ├── client.py            # SerpClient and quick functions
│   ├── cache.py             # Disk-based caching (DiskCache, NullCache)
│   ├── compression.py       # Content compression (compress_content, CompressionMeta)
│   ├── config.py            # Configuration constants (BING_URL_TEMPLATE, USER_AGENTS)
│   ├── config_pydantic.py   # Pydantic-based SerpConfig
│   ├── google_news.py       # Google News RSS client
│   ├── google_scholar.py    # Google Scholar client
│   ├── http_search.py       # HTTP-based search (httpx)
│   ├── output_formatter.py  # Text and JSON output formatting
│   ├── parsers.py           # Browser-based parsing (nodriver)
│   ├── types.py             # Type definitions (SearchResult, RetryPolicy, etc.)
│   └── utils.py             # Exceptions, helpers, constants
├── api/                     # REST API (optional)
│   ├── __init__.py
│   ├── main.py              # FastAPI application
│   ├── config.py            # API configuration (APISettings, RateLimitConfig)
│   ├── deps.py              # DI: auth, rate limit, semaphore
│   ├── exceptions.py        # Custom exception classes
│   ├── models/              # Pydantic models
│   │   ├── __init__.py
│   │   ├── requests.py      # Request schemas
│   │   └── responses.py     # Response schemas
│   ├── routers/             # API routes
│   │   ├── __init__.py
│   │   ├── health.py        # GET /health
│   │   ├── search.py        # POST /api/v1/search
│   │   ├── fetch.py         # POST /api/v1/fetch
│   │   ├── news.py          # POST /api/v1/news
│   │   └── scholar.py       # POST /api/v1/scholar
│   ├── middleware/          # Middleware
│   │   ├── __init__.py
│   │   ├── rate_limit.py    # Sliding window rate limiter
│   │   └── logging_middleware.py
│   ├── utils/               # Backward-compat re-exports
│   │   ├── __init__.py
│   │   └── compression.py   # Re-exports from serp.compression
│   └── cli/                 # API CLI tools
│       ├── __init__.py
│       └── keys.py          # API key generation
├── tests/                   # Test suite
│   ├── __init__.py
│   ├── conftest.py          # Fixtures, global state reset
│   ├── test_serp.py         # Core library tests
│   ├── test_cache.py        # Cache tests
│   ├── test_google_news.py  # Google News tests
│   └── api/                 # API tests
│       └── __init__.py
├── main.py                  # Interactive CLI tool
├── .env.example            # Environment variables template
├── pyproject.toml          # Project metadata
└── README.md               # This file

Testing

Run the test suite:

pytest

Run with coverage:

pytest --cov=serp --cov-report=html

Run specific test file:

pytest tests/test_serp.py

Architecture

Search Flow

Check cache for existing results (if use_cache=True)
Load proxy configuration from file or environment
Select random proxy (or DataImpulse if configured)
Create browser with stealth settings
Navigate to search URL
Wait for results to load
Check for CAPTCHA
Parse organic results
Cache results before returning

Fetch Strategy (per docs/Scraper Strategy.md)

For Google Scholar, Google News, and URL Fetch:

Primary: BS4 (BeautifulSoup4) + HTTP - fast, lightweight
Fallback: Browser (nodriver) - triggered on BS4 failure

For Google and Bing SERP:

Always uses browser (nodriver) because JavaScript is required to render results

Search Methods

SERP Search (Google/Bing):

Always uses nodriver for stealth Chrome automation
More reliable, harder to detect
Slower due to browser overhead

URL Fetch / Google News / Google Scholar:

Default: BS4-first, browser fallback on failure
BS4 method: Uses httpx for direct HTTP requests with BeautifulSoup4 parsing
Browser fallback: Uses nodriver when BS4 fails (empty content, CAPTCHA, parse errors, timeouts)

Caching

The caching system uses a disk-based approach:

Cache entries stored as JSON files in .cache/serp/
Keys are SHA256 hashes of query parameters
Automatic expiration based on TTL
Can be disabled via cache_enabled=False or SERP_CACHE_ENABLED=false

Content Compression

The library includes a built-in content compression utility for truncating long content while preserving key information:

Algorithm:

If content length ≤ threshold (default: 10,000 chars), return unchanged
Target compressed length = 45% of threshold (~4,500 chars)
Extract three portions: head (35%), middle (15%), tail (50%)
Join with a truncation marker: \n\n-- X,XXX chars truncated --\n\n

Usage scenarios:

LLM context optimization: Reduce token count while preserving document structure
API responses: The REST API /fetch endpoint supports "compress": true
Direct library use: from serp import compress_content or client.fetch(url, compress=True)

Error Handling

The library provides specific exceptions for different failure modes:

ProxyError: All configured proxies failed or returned errors
CaptchaError: Search engine detected automation and presented CAPTCHA
PageTimeoutError: Page did not load within the timeout period
ParseError: Page loaded but results could not be parsed
VirtualScreenRequiredError: Non-headless mode selected but no virtual display available (use xvfb-run, set DISPLAY, or use headless=True)

Dependencies

Package	Version	Purpose
nodriver	>=0.50.0	Stealth Chrome automation
markdownify	>=0.12.0	HTML to Markdown conversion
httpx	>=0.25.0	Async HTTP client
beautifulsoup4	>=4.12.0	HTML parsing
pydantic	>=2.0.0	Configuration validation
pydantic-settings	>=2.0.0	Pydantic settings integration
python-dotenv	>=1.0.0	.env file support

API Dependencies (Optional)

Install with pip install serp-scraper[api] for REST API support:

Package	Version	Purpose
fastapi	>=0.100.0	REST API framework
uvicorn[standard]	>=0.23.0	ASGI server
passlib[bcrypt]	>=1.7.4	Password hashing
bcrypt	<4.1.0	Password hashing

Development Dependencies

Package	Version	Purpose
pytest	>=7.0.0	Testing framework
pytest-asyncio	>=0.21.0	Async test support
pytest-cov	>=4.0.0	Coverage reporting
pytest-mock	>=3.10.0	Mocking utilities
pytest-httpserver	>=1.0.0	HTTP server for testing
ruff	>=0.1.0	Linting
mypy	>=1.0.0	Type checking
build	>=1.0.0	Package building
twine	>=4.0.0	Package publishing

License

MIT License - see LICENSE file for details.

Author

neuronaline - flashneuron@proton.me

Disclaimer

This software is provided for educational and legitimate purposes only. Users are responsible for ensuring their use complies with search engine Terms of Service and applicable laws. The authors assume no liability for misuse of this software.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.13

Jun 14, 2026

2.0.12

May 28, 2026

2.0.11

May 25, 2026

2.0.10

May 21, 2026

2.0.9

May 21, 2026

2.0.8

May 21, 2026

2.0.7

May 20, 2026

2.0.6

May 17, 2026

2.0.5

May 15, 2026

2.0.3

May 12, 2026

2.0.2

May 12, 2026

2.0.1

May 11, 2026

2.0.0

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serp_scraper-2.0.13.tar.gz (99.4 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

serp_scraper-2.0.13-py3-none-any.whl (92.4 kB view details)

Uploaded Jun 14, 2026 Python 3

File details

Details for the file serp_scraper-2.0.13.tar.gz.

File metadata

Download URL: serp_scraper-2.0.13.tar.gz
Upload date: Jun 14, 2026
Size: 99.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for serp_scraper-2.0.13.tar.gz
Algorithm	Hash digest
SHA256	`dfca897af996cda74d393974ba5922297add29a2e1b97e9afbeb7acaeaad3366`
MD5	`c84f9d9e6ac65b58fcd7f47c2e79c519`
BLAKE2b-256	`4098fcbc1b4a07afea2393c9746071d86fa1016e0694403d0c9f87127dfd7127`

See more details on using hashes here.

File details

Details for the file serp_scraper-2.0.13-py3-none-any.whl.

File metadata

Download URL: serp_scraper-2.0.13-py3-none-any.whl
Upload date: Jun 14, 2026
Size: 92.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for serp_scraper-2.0.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3a6e9c16cabab5bd10bce1a2efac5688ef4811c7489d57a30c598fc2e3749913`
MD5	`a073410228873fc9a9e3ff8edbd62d54`
BLAKE2b-256	`cce33658e404c5474e62259e41540465f04c25e1c25545d554f3258ffad90bc8`

See more details on using hashes here.

serp-scraper 2.0.13

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SERP Scraper

Features

Installation

Basic Installation

From Source

With Dependencies

Requirements

Quick Start

Recommended: Using SerpClient

Running on Headless Servers

Google News RSS

URL Fetching

Content Compression

Quick Functions

Using with Configuration

Environment Variables (.env file)

Configuration

SerpConfig Options

Environment Variables

API Reference

SerpClient

Methods

client.search(query, page_num=1, source=None, use_cache=None)

client.fetch(url, use_cache=None, prefer_browser=False, compress=False)

SearchResult

Quick Functions

GoogleNewsClient

Methods

client.get_news(query, max_results=50, queries=None)

quick_news(query, max_results=50, language="tr", country="TR")

NewsResult

NewsSettings

OutputFormatter

Utility Functions

set_log_level(level)

compress_content(content, threshold=10000, head_pct=0.35, middle_pct=0.15, tail_pct=0.50)

Exceptions

Constants

Interactive CLI

REST API (Optional)

Installation

Running the API

API Endpoints

Authentication

Rate Limiting

Project Structure

Testing

Architecture

Search Flow

Fetch Strategy (per docs/Scraper Strategy.md)

Search Methods

Caching

Content Compression

Error Handling

Dependencies

API Dependencies (Optional)

Development Dependencies

License

Author

Links

Disclaimer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

`client.search(query, page_num=1, source=None, use_cache=None)`

`client.fetch(url, use_cache=None, prefer_browser=False, compress=False)`

`client.get_news(query, max_results=50, queries=None)`

`quick_news(query, max_results=50, language="tr", country="TR")`

`set_log_level(level)`

`compress_content(content, threshold=10000, head_pct=0.35, middle_pct=0.15, tail_pct=0.50)`