SERP scraper with undetected-chromedriver, proxy rotation, and stealth browsing
Project description
⚠️ WARNING: This project is in early development stages.
SERP Scraper
A powerful, async Python library for scraping Google and Bing Search Engine Results Pages (SERPs) with proxy rotation, intelligent caching, and stealth browsing.
Features
- Dual Search Methods: Browser-based (
nodriver) and HTTP-based (httpx) scraping - Google News RSS: Scrape news articles via Google News RSS feeds
- Proxy Rotation: DataImpulse and custom proxy support with automatic rotation
- Intelligent Caching: Disk-based caching with configurable TTL
- CAPTCHA Handling: Automatic detection with retry logic and exponential backoff
- Type Safety: Full type annotations with Pydantic validation
- Async/Await: Modern asynchronous API design
- Environment Config:
.envfile support for configuration - CLI Tool: Interactive command-line interface for testing
Installation
Basic Installation
pip install serp-scraper
From Source
git clone https://github.com/neuronaline/serp-scraper.git
cd serp-scraper
pip install -e .
With Dependencies
pip install serp-scraper[dev] # With dev tools
pip install serp-scraper[test] # With test dependencies
Requirements
- Python 3.10 or higher
- Google Chrome browser installed
Quick Start
Recommended: Using SerpClient
import asyncio
from serp import SerpClient
async def main():
async with SerpClient() as client:
results = await client.search("python programming")
for r in results:
print(f"{r.rank}. {r.title}")
print(f" {r.url}")
print(f" {r.description[:100]}...")
asyncio.run(main())
Quick Functions
For simple use cases without creating a client:
import asyncio
from serp import quick_search, quick_fetch
async def main():
# Search
results = await quick_search("web scraping")
print(f"Found {len(results)} results")
# Fetch URL
content = await quick_fetch("https://example.com")
print(content[:500])
asyncio.run(main())
Google News RSS
Scrape news articles using Google News RSS feeds:
import asyncio
from serp import GoogleNewsClient
async def main():
async with GoogleNewsClient(language="en", country="US") as client:
news = await client.get_news("Tesla", max_results=20)
for r in news:
print(f"{r.title}")
print(f" Source: {r.source}")
print(f" URL: {r.url}")
print(f" Date: {r.published}")
asyncio.run(main())
Or use the quick function:
import asyncio
from serp import quick_news
async def main():
news = await quick_news("Tesla", language="en", country="US", max_results=20)
print(f"Found {len(news)} news articles")
asyncio.run(main())
URL Fetching
import asyncio
from serp import SerpClient
async def main():
async with SerpClient() as client:
# Fetch page content as Markdown
content = await client.fetch("https://example.com")
print(content)
asyncio.run(main())
Using with Configuration
import asyncio
from serp import SerpClient, SerpConfig
# Create configured client
config = SerpConfig(
log_level="DEBUG",
max_retries=5,
cache_ttl=3600, # 1 hour
cache_enabled=True,
)
async with SerpClient(config) as client:
results = await client.search("python tutorial")
Environment Variables (.env file)
Create a .env file in your project. Copy from .env.example for all options:
# DataImpulse Proxy (recommended)
SERP_DATAIMPULSE_GATEWAY=http://gw.dataimpulse.com:10001
SERP_DATAIMPULSE_USER=your_username
SERP_DATAIMPULSE_PASS=your_password
# Custom Proxies (comma-separated)
SERP_CUSTOM_PROXIES=http://user:pass@proxy1.com:8080,socks5://proxy2.com:1080
# Proxy Strategy: "random" or "dataimpulse_first"
SERP_PROXY_STRATEGY=dataimpulse_first
# Logging
SERP_LOG_LEVEL=WARNING
SERP_DEBUG=false
# Cache
SERP_CACHE_ENABLED=true
SERP_CACHE_DIR=.cache/serp
SERP_CACHE_TTL=86400
# Search
SERP_DEFAULT_SOURCE=auto # "google", "bing", or "auto"
SERP_HEADLESS=false
SERP_TIMEOUT=30
# Retry
SERP_MAX_RETRIES=3
SERP_RETRY_DELAY_MIN=0.5
SERP_RETRY_DELAY_MAX=2.0
SERP_EXPONENTIAL_BACKOFF=false
# Custom User Agent (optional)
SERP_USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Configuration
SerpConfig Options
| Parameter | Type | Default | Description |
|---|---|---|---|
custom_proxies |
str | "" |
Comma-separated proxy URLs from env |
proxy_strategy |
str | "dataimpulse_first" |
Proxy selection: "random" or "dataimpulse_first" |
dataimpulse_gateway |
str | None |
DataImpulse gateway URL |
dataimpulse_user |
str | None |
DataImpulse username |
dataimpulse_pass |
str | None |
DataImpulse password |
log_level |
str | "WARNING" |
Logging level (DEBUG, INFO, WARNING, ERROR) |
max_retries |
int | 3 |
Maximum retry attempts (1-10) |
retry_delay_min |
float | 0.5 |
Minimum retry delay in seconds |
retry_delay_max |
float | 2.0 |
Maximum retry delay in seconds |
exponential_backoff |
bool | false |
Use exponential backoff |
timeout |
int | 30 |
Request timeout in seconds (5-120) |
cache_enabled |
bool | true |
Enable/disable caching |
cache_dir |
str | ".cache/serp" |
Cache directory path |
cache_ttl |
int | 86400 |
Cache TTL in seconds (min 60) |
default_source |
str | "auto" |
Default search source: "google", "bing", or "auto" |
headless |
bool | false |
Run browser in headless mode |
user_agent |
str | None |
Custom user agent string |
Environment Variables
| Variable | Description | Default |
|---|---|---|
SERP_DATAIMPULSE_GATEWAY |
DataImpulse gateway URL | - |
SERP_DATAIMPULSE_USER |
DataImpulse username | - |
SERP_DATAIMPULSE_PASS |
DataImpulse password | - |
SERP_CUSTOM_PROXIES |
Comma-separated proxy URLs | - |
SERP_PROXY_STRATEGY |
Proxy selection strategy | dataimpulse_first |
SERP_LOG_LEVEL |
Logging level | WARNING |
SERP_CACHE_DIR |
Cache directory path | .cache/serp |
SERP_CACHE_TTL |
Default cache TTL in seconds | 86400 |
SERP_CACHE_ENABLED |
Enable/disable caching | true |
SERP_MAX_RETRIES |
Maximum retry attempts | 3 |
SERP_RETRY_DELAY_MIN |
Minimum retry delay (seconds) | 0.5 |
SERP_RETRY_DELAY_MAX |
Maximum retry delay (seconds) | 2.0 |
SERP_EXPONENTIAL_BACKOFF |
Use exponential backoff | false |
SERP_TIMEOUT |
Request timeout in seconds | 30 |
SERP_DEBUG |
Enable debug logging | false |
SERP_DOTENV_FILE |
Path to .env file | Auto-detect |
API Reference
SerpClient
The recommended high-level interface for using the library.
from serp import SerpClient
client = SerpClient(
headless=False, # Optional
use_cache=True, # Optional
cache_ttl=86400, # Optional
source=None, # Optional: "google", "bing", or None (auto)
max_retries=3, # Optional
timeout=30, # Optional
log_level="WARNING", # Optional
)
Methods
client.search(query, page_num=1, method=None, source=None, use_cache=None)
Search for a query and return results.
Parameters:
query(str): Search query stringpage_num(int): Page number (1-based), defaults to 1method(str): Search method -"browser"(nodriver),"http"(httpx), orNone(auto)source(str): Search engine -"google","bing", orNone(auto: google first, bing fallback)use_cache(bool): Whether to use cache.Noneuses client default.
Returns:
list[SearchResult]: List of SearchResult objects
Raises:
ProxyError: All proxies failedCaptchaError: CAPTCHA detected after all retriesPageTimeoutError: Page load timeoutParseError: Failed to parse results
client.fetch(url, use_cache=None, prefer_browser=True)
Fetch a URL and return content as Markdown.
Parameters:
url(str): Target URLuse_cache(bool): Whether to use cache.Noneuses client default.prefer_browser(bool): If True, use browser directly. If False, try HTTP first then fallback to browser.
Returns:
str: Page content converted to Markdown
SearchResult
Typed result object returned by search operations.
from serp import SearchResult
result = SearchResult(
rank=1,
title="Example Title",
url="https://example.com",
description="Example description...",
source="google" # or "bing"
)
Attributes:
rank(int): Position in search results (1-based)title(str): Result titleurl(str): Target URLdescription(str): Result snippet/descriptionsource(str): Search engine source ("google" or "bing")
Methods:
to_dict(): Convert to dictionary for backward compatibility
Quick Functions
Module-level convenience functions using default client:
from serp import quick_search, quick_fetch, quick_search_http
# Quick search (auto method)
results = await quick_search("query")
# HTTP-based search only
results = await quick_search_http("query")
# Fetch URL
content = await quick_fetch("https://example.com")
GoogleNewsClient
Client for scraping Google News via RSS feeds.
from serp import GoogleNewsClient
client = GoogleNewsClient(
language="tr", # Language code (tr, en, etc.)
country="TR", # Country code (TR, US, etc.)
time_range="d", # Time range: "h" (hour), "d" (day), "w" (week), "m" (month)
)
Methods
client.get_news(query, max_results=50, queries=None)
Get news articles for a search term.
Parameters:
query(str): Search term to find news formax_results(int): Maximum number of results (default: 50)queries(list[str]): Custom list of queries to use
Returns:
list[NewsResult]: List of NewsResult objects
quick_news(query, max_results=50, language="tr", country="TR")
Convenience function for quick news retrieval.
from serp import quick_news
news = await quick_news("Tesla", max_results=20, language="en", country="US")
NewsResult
Typed result object for Google News articles.
from serp import NewsResult
result = NewsResult(
title="Tesla announces new model",
url="https://news.google.com/rss/articles/...",
original_url="https://example.com/article",
published=datetime(2026, 5, 11, 8, 0, 0),
source="BBC",
description="Tesla unveiled...",
query="Tesla"
)
Attributes:
title(str): News headlineurl(str): Google News RSS URLoriginal_url(str): Original article URL (extracted from description)published(datetime): Publication datesource(str): News source name (e.g., "BBC", "NTV")description(str): News summary/snippetquery(str): Search query that returned this result
Methods:
to_dict(): Convert to dictionary
NewsSettings
Configuration for Google News scraping.
from serp import NewsSettings
settings = NewsSettings(
language="tr", # Language code
country="TR", # Country code
time_range="d", # Time range: "h", "d", "w", "m"
)
Attributes:
language(str): Language code (default: "tr")country(str): Country code (default: "TR")time_range(str): Time range filter (default: "d")
Utility Functions
set_log_level(level)
Set the log level for all serp loggers.
from serp import set_log_level
set_log_level("DEBUG") # Enable debug logging
set_log_level("WARNING") # Only show warnings and errors
Exceptions
| Exception | Description |
|---|---|
ProxyError |
All proxies failed |
CaptchaError |
CAPTCHA could not be solved after retries |
PageTimeoutError |
Page load timeout |
ParseError |
Failed to parse results |
Constants
| Constant | Description |
|---|---|
MAX_RETRIES |
Maximum retry attempts (default: 3) |
TIMEOUT_MS |
Page timeout in milliseconds (default: 30000) |
USER_AGENTS |
List of user agent strings for rotation |
Interactive CLI
The package includes an interactive CLI tool for testing:
python main.py
Features:
- SERP Search testing
- URL Fetch testing
- Google News RSS testing
- Proxy status checking
Project Structure
serp-scraper/
├── serp/ # Main package
│ ├── __init__.py # Exports and API
│ ├── client.py # SerpClient and quick functions
│ ├── config.py # Configuration constants
│ ├── config_pydantic.py # Pydantic-based configuration
│ ├── types.py # Type definitions (SearchResult, etc.)
│ ├── google_news.py # Google News RSS client
│ ├── search.py # Browser-based search
│ ├── fetch.py # URL fetch functionality
│ ├── simple.py # HTTP-based search
│ ├── parsers.py # Result parsing logic
│ ├── cache.py # Disk-based caching
│ └── utils.py # Utilities and helpers
├── tests/ # Test suite
│ ├── conftest.py # Test fixtures
│ ├── test_serp.py
│ ├── test_google_news.py
│ └── test_cache.py
├── main.py # Interactive CLI tool
├── .env.example # Environment variables template
├── pyproject.toml # Project metadata
└── README.md # This file
Testing
Run the test suite:
pytest
Run with coverage:
pytest --cov=serp --cov-report=html
Run specific test file:
pytest tests/test_serp.py
Architecture
Search Flow
- Check cache for existing results (if
use_cache=True) - Load proxy configuration from file or environment
- Select random proxy (or DataImpulse if configured)
- Create browser with stealth settings (browser method) or use HTTP client (http method)
- Navigate to search URL
- Wait for results to load
- Check for CAPTCHA
- Parse organic results
- Cache results before returning
Search Methods
Browser Method (method="browser"):
- Uses
nodriverfor stealth Chrome automation - More reliable, harder to detect
- Slower due to browser overhead
HTTP Method (method="http"):
- Uses
httpxfor direct HTTP requests - Faster, less resource intensive
- May be blocked more easily
Caching
The caching system uses a disk-based approach:
- Cache entries stored as JSON files in
.cache/serp/ - Keys are SHA256 hashes of query parameters
- Automatic expiration based on TTL
- Can be disabled via
cache_enabled=FalseorSERP_CACHE_ENABLED=false
Error Handling
The library provides specific exceptions for different failure modes:
- ProxyError: All configured proxies failed or returned errors
- CaptchaError: Search engine detected automation and presented CAPTCHA
- PageTimeoutError: Page did not load within the timeout period
- ParseError: Page loaded but results could not be parsed
Dependencies
| Package | Version | Purpose |
|---|---|---|
| nodriver | >=4.0.0 | Stealth Chrome automation |
| markdownify | >=0.12.0 | HTML to Markdown conversion |
| httpx | >=0.25.0 | Async HTTP client |
| beautifulsoup4 | >=4.12.0 | HTML parsing |
| pydantic | >=2.0.0 | Configuration validation |
| python-dotenv | >=1.0.0 | .env file support |
Development Dependencies
| Package | Version | Purpose |
|---|---|---|
| pytest | >=7.0.0 | Testing framework |
| pytest-asyncio | >=0.21.0 | Async test support |
| pytest-cov | >=4.0.0 | Coverage reporting |
| pytest-mock | >=3.10.0 | Mocking utilities |
| pytest-httpserver | >=1.0.0 | HTTP server for testing |
| ruff | >=0.1.0 | Linting |
| mypy | >=1.0.0 | Type checking |
| build | >=1.0.0 | Package building |
| twine | >=4.0.0 | Package publishing |
License
MIT License - see LICENSE file for details.
Author
neuronaline - flashneuron@proton.me
Links
Disclaimer
This software is provided for educational and legitimate purposes only. Users are responsible for ensuring their use complies with search engine Terms of Service and applicable laws. The authors assume no liability for misuse of this software.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file serp_scraper-2.0.1.tar.gz.
File metadata
- Download URL: serp_scraper-2.0.1.tar.gz
- Upload date:
- Size: 48.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0953ba48c153e4b4b568829d44d3c66a75e1b67b882c4a8f4d08900884fcac27
|
|
| MD5 |
770c45b34757158cffcd5ba99a5ca2f4
|
|
| BLAKE2b-256 |
5691c5a8c20b39789ce7f420783b8839ee2b221dbb2dd9a0830ac7b9ca0e5ddd
|
File details
Details for the file serp_scraper-2.0.1-py3-none-any.whl.
File metadata
- Download URL: serp_scraper-2.0.1-py3-none-any.whl
- Upload date:
- Size: 39.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
358cabff2c3f18926cdd9df10fb4b6b7ca9132ad9b74e8298cb5a55abb0c5752
|
|
| MD5 |
abe859aacab5af629076e837d46fe4c7
|
|
| BLAKE2b-256 |
7db4cd8a97c603ec16ade7807dc8a6034d5f1426c9985dff4f10036d40164659
|