SERP scraper with nodriver, proxy rotation, and stealth browsing
Project description
⚠️ WARNING: This project is in early development stages.
SERP Scraper
A powerful, async Python library for scraping Google and Bing Search Engine Results Pages (SERPs) with proxy rotation, intelligent caching, and stealth browsing.
Features
- Dual Search Methods: Browser-based (
nodriver) and HTTP-based (httpx) scraping - Google News RSS: Scrape news articles via Google News RSS feeds
- Google Scholar: Search academic papers from Google Scholar
- Proxy Rotation: DataImpulse and custom proxy support with automatic rotation
- Intelligent Caching: Disk-based caching with configurable TTL
- CAPTCHA Handling: Automatic detection with retry logic and exponential backoff
- Type Safety: Full type annotations with Pydantic validation
- Async/Await: Modern asynchronous API design
- Environment Config:
.envfile support for configuration - CLI Tool: Interactive command-line interface for testing
- Dual Output Format: Text (human-readable) and JSON (LLM-friendly) output
- REST API: Optional FastAPI-based REST API with rate limiting and authentication
- Content Compression: Built-in
compress_content()utility for truncating long content into head, middle, and tail portions — available both in the core library and the REST API
Installation
Basic Installation
pip install serp-scraper
From Source
git clone https://github.com/neuronaline/serp-scraper.git
cd serp-scraper
pip install -e .
With Dependencies
pip install serp-scraper[dev] # With dev tools
pip install serp-scraper[test] # With test dependencies
pip install serp-scraper[api] # With REST API (FastAPI)
Requirements
- Python 3.10 or higher
- Google Chrome browser installed
- Virtual display (for non-headless mode, the default):
- Linux/headless servers: Install Xvfb (
sudo apt install xvfb) and useDISPLAY=:99or run withxvfb-run - macOS: No additional setup needed (has built-in display)
- Windows: No additional setup needed (has built-in display)
- Linux/headless servers: Install Xvfb (
Quick Start
Recommended: Using SerpClient
import asyncio
from serp import SerpClient
async def main():
async with SerpClient() as client:
results = await client.search("python programming")
for r in results:
print(f"{r.rank}. {r.title}")
print(f" {r.url}")
print(f" {r.description[:100]}...")
asyncio.run(main())
Running on Headless Servers
By default, the browser runs in non-headless mode (visible window) which requires a display. For headless servers or CI/CD environments, use one of these approaches:
Option 1: Use xvfb-run (recommended for Linux):
xvfb-run -a python your_script.py
Option 2: Set DISPLAY environment variable:
DISPLAY=:99 python your_script.py
Option 3: Run in headless mode:
async with SerpClient(headless=True) as client:
results = await client.search("python programming")
Or via environment variable:
SERP_HEADLESS=true python your_script.py
Note: The VirtualScreenRequiredError exception is raised when running non-headless without a display.
Google News RSS
Scrape news articles using Google News RSS feeds:
import asyncio
from serp import GoogleNewsClient
async def main():
async with GoogleNewsClient(language="en", country="US") as client:
news = await client.get_news("Tesla", max_results=20)
for r in news:
print(f"{r.title}")
print(f" Source: {r.source}")
print(f" URL: {r.url}")
print(f" Date: {r.published}")
asyncio.run(main())
Or use the quick function:
import asyncio
from serp import quick_news
async def main():
news = await quick_news("Tesla", language="en", country="US", max_results=20)
print(f"Found {len(news)} news articles")
asyncio.run(main())
URL Fetching
import asyncio
from serp import SerpClient
async def main():
async with SerpClient() as client:
# Fetch page content as Markdown
# Automatically detects JavaScript and uses browser if needed
content = await client.fetch("https://example.com")
print(content)
asyncio.run(main())
Fetch Strategy:
- Static pages (no JavaScript): Uses fast HTTP + BeautifulSoup4
- JavaScript-detected pages: Automatically uses browser (nodriver) for execution
- Failed/incomplete fetch: Falls back to browser
- Full page load guarantee: Browser waits for
loadevent + 1s for JS rendering
Content Compression
Compress long content into head, middle, and tail portions — useful for LLM contexts where you want to retain key information while reducing token count:
from serp import compress_content, CompressionMeta
long_text = "..." # e.g. 20,000+ characters
compressed, meta = compress_content(long_text)
if meta.was_truncated:
print(f"Reduced from {meta.original_length:,} to {meta.compressed_length:,} chars")
print(f"Truncated {meta.truncated_chars:,} characters")
Compression can also be enabled directly during fetch:
content = await client.fetch("https://example.com", compress=True)
The standalone compress_content() function gives you full control over thresholds and metadata, while the compress=True parameter on fetch() is a convenience shortcut.
Quick Functions
For simple use cases without creating a client:
import asyncio
from serp import quick_search, quick_fetch
async def main():
# Search
results = await quick_search("web scraping")
print(f"Found {len(results)} results")
# Fetch URL (with optional compression)
content = await quick_fetch("https://example.com", compress=True)
print(content[:500])
asyncio.run(main())
Using with Configuration
import asyncio
from serp import SerpClient, SerpConfig
# Create configured client
config = SerpConfig(
log_level="DEBUG",
max_retries=5,
cache_ttl=3600, # 1 hour
cache_enabled=True,
)
async with SerpClient(config) as client:
results = await client.search("python tutorial")
Environment Variables (.env file)
Create a .env file in your project. Copy from .env.example for all options:
# DataImpulse Proxy (recommended)
SERP_DATAIMPULSE_GATEWAY=http://gw.dataimpulse.com:10001
SERP_DATAIMPULSE_USER=your_username
SERP_DATAIMPULSE_PASS=your_password
# Custom Proxies (comma-separated)
SERP_CUSTOM_PROXIES=http://user:pass@proxy1.com:8080,socks5://proxy2.com:1080
# Proxy Strategy: "random" or "dataimpulse_first"
SERP_PROXY_STRATEGY=dataimpulse_first
# Logging
SERP_LOG_LEVEL=WARNING
SERP_DEBUG=false
# Cache
SERP_CACHE_ENABLED=true
SERP_CACHE_DIR=.cache/serp
SERP_CACHE_TTL=86400
# Search
SERP_DEFAULT_SOURCE=auto # "google", "bing", or "auto"
SERP_HEADLESS=false
SERP_TIMEOUT=30
# Retry
SERP_MAX_RETRIES=3
SERP_RETRY_DELAY_MIN=0.5
SERP_RETRY_DELAY_MAX=2.0
SERP_EXPONENTIAL_BACKOFF=false
# Custom User Agent (optional)
SERP_USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Configuration
SerpConfig Options
| Parameter | Type | Default | Description |
|---|---|---|---|
| Proxy Settings | |||
dataimpulse_gateway |
str | None |
DataImpulse gateway URL |
dataimpulse_user |
str | None |
DataImpulse username |
dataimpulse_pass |
str | None |
DataImpulse password |
dataimpulse_protocol |
str | "http" |
DataImpulse proxy protocol ("http" or "socks5") |
dataimpulse_country |
str | None |
DataImpulse country code (optional) |
dataimpulse_sessid |
str | None |
DataImpulse session ID for sticky proxy (optional) |
dataimpulse_sessttl |
int | None |
DataImpulse session TTL in minutes (optional) |
custom_proxies |
str | "" |
Comma-separated proxy URLs |
proxy_strategy |
str | "dataimpulse_first" |
Proxy selection: "random" or "dataimpulse_first" |
| Cache Settings | |||
cache_enabled |
bool | true |
Enable/disable caching |
cache_dir |
str | ".cache/serp" |
Cache directory path |
cache_ttl |
int | 86400 |
Cache TTL in seconds (min 60) |
| Retry Settings | |||
max_retries |
int | 3 |
Maximum retry attempts (1-10) |
retry_delay_min |
float | 0.5 |
Minimum retry delay in seconds |
retry_delay_max |
float | 2.0 |
Maximum retry delay in seconds |
exponential_backoff |
bool | false |
Use exponential backoff |
| Search Settings | |||
default_source |
str | "auto" |
Default search source: "google", "bing", or "auto" |
headless |
bool | false |
Run browser in headless mode (requires virtual display when false) |
timeout |
int | 30 |
Request timeout in seconds (5-120) |
user_agent |
str | None |
Custom user agent string |
| Logging | |||
log_level |
str | "WARNING" |
Logging level (DEBUG, INFO, WARNING, ERROR) |
Environment Variables
| Variable | Description | Default |
|---|---|---|
SERP_DATAIMPULSE_GATEWAY |
DataImpulse gateway URL | - |
SERP_DATAIMPULSE_USER |
DataImpulse username | - |
SERP_DATAIMPULSE_PASS |
DataImpulse password | - |
SERP_DATAIMPULSE_PROTOCOL |
DataImpulse proxy protocol | http |
SERP_DATAIMPULSE_COUNTRY |
DataImpulse country code (optional) | - |
SERP_DATAIMPULSE_SESSID |
DataImpulse session ID for sticky proxy (optional) | - |
SERP_DATAIMPULSE_SESSTTL |
DataImpulse session TTL in minutes (optional) | - |
SERP_CUSTOM_PROXIES |
Comma-separated proxy URLs | - |
SERP_PROXY_STRATEGY |
Proxy selection strategy | dataimpulse_first |
SERP_LOG_LEVEL |
Logging level | WARNING |
SERP_CACHE_DIR |
Cache directory path | .cache/serp |
SERP_CACHE_TTL |
Default cache TTL in seconds | 86400 |
SERP_CACHE_ENABLED |
Enable/disable caching | true |
SERP_MAX_RETRIES |
Maximum retry attempts | 3 |
SERP_RETRY_DELAY_MIN |
Minimum retry delay (seconds) | 0.5 |
SERP_RETRY_DELAY_MAX |
Maximum retry delay (seconds) | 2.0 |
SERP_EXPONENTIAL_BACKOFF |
Use exponential backoff | false |
SERP_TIMEOUT |
Request timeout in seconds | 30 |
SERP_DEBUG |
Enable debug logging | false |
SERP_DOTENV_FILE |
Path to .env file | Auto-detect |
| API Settings | ||
API_HOST |
API server host | 0.0.0.0 |
API_PORT |
API server port | 8000 |
API_DEBUG |
API debug mode | false |
API_MAX_CONCURRENT_REQUESTS |
Max concurrent requests (pool size) | 15 |
API_REQUEST_TIMEOUT |
Request timeout in seconds | 60 |
API_RATE_LIMIT_SEARCH |
Rate limit for /search (req/min) | 30 |
API_RATE_LIMIT_FETCH |
Rate limit for /fetch (req/min) | 60 |
API_RATE_LIMIT_NEWS |
Rate limit for /news (req/min) | 30 |
API_RATE_LIMIT_SCHOLAR |
Rate limit for /scholar (req/min) | 30 |
API_RATE_LIMIT_DEFAULT |
Default rate limit (req/min) | 100 |
API_KEYS_HASHED |
Comma-separated hashed API keys | - |
API_ALLOW_NO_AUTH |
Allow unauthenticated access | false |
API_LOG_LEVEL |
API logging level | INFO |
API_LOG_DIR |
API log directory | logs |
API_LOG_RETENTION_DAYS |
Log retention days | 7 |
API_CORS_ORIGINS |
CORS allowed origins (comma-separated) | - |
API Reference
SerpClient
The recommended high-level interface for using the library.
from serp import SerpClient
client = SerpClient(
headless=False, # Optional
use_cache=True, # Optional
cache_ttl=86400, # Optional
source=None, # Optional: "google", "bing", or None (auto)
max_retries=3, # Optional
timeout=30, # Optional
log_level="WARNING", # Optional
)
Methods
client.search(query, page_num=1, source=None, use_cache=None)
Search for a query and return results.
Note: Google and Bing SERP searches always use browser (nodriver) because these search engines require JavaScript to render results. The method parameter is deprecated and ignored.
Parameters:
query(str): Search query stringpage_num(int): Page number (1-based), defaults to 1source(str): Search engine -"google","bing", orNone(auto: google first, bing fallback)use_cache(bool): Whether to use cache.Noneuses client default.
Returns:
list[SearchResult]: List of SearchResult objects
Raises:
ProxyError: All proxies failedCaptchaError: CAPTCHA detected after all retriesPageTimeoutError: Page load timeoutParseError: Failed to parse results
client.fetch(url, use_cache=None, prefer_browser=False, compress=False)
Fetch a URL and return content as Markdown.
Fetch Strategy:
- Static pages (no JavaScript): Uses fast HTTP + BeautifulSoup4 - low resource usage, fast execution
- JavaScript detected in HTML: Immediately falls back to browser (nodriver) for execution
- Failed/incomplete fetch: Falls back to browser (handles CAPTCHA, anti-bot measures)
- Full page load guarantee: Browser waits for
loadevent + 1s additional wait for JS rendering
Parameters:
url(str): Target URLuse_cache(bool): Whether to use cache.Noneuses client default.prefer_browser(bool): If True, use browser directly (legacy mode). If False (default), try BS4 first then fallback to browser.compress(bool): If True and content exceeds ~10K chars, compress by taking head, middle, and tail portions. (Default: False)
Returns:
str: Page content converted to Markdown (optionally compressed)
Note: When you need metadata about the truncation (original length, etc.), use the standalone compress_content() function from the serp package instead of the compress parameter on fetch().
SearchResult
Typed result object returned by search operations.
from serp import SearchResult
result = SearchResult(
rank=1,
title="Example Title",
url="https://example.com",
description="Example description...",
source="google" # or "bing"
)
Attributes:
rank(int): Position in search results (1-based)title(str): Result titleurl(str): Target URLdescription(str): Result snippet/descriptionsource(str): Search engine source ("google" or "bing")
Methods:
to_dict(): Convert to dictionary for backward compatibility
Quick Functions
Module-level convenience functions using default client:
from serp import quick_search, quick_fetch, quick_search_http
# Quick search (auto method)
results = await quick_search("query")
# HTTP-based search only
results = await quick_search_http("query")
# Fetch URL (with optional compression)
content = await quick_fetch("https://example.com", compress=True)
GoogleNewsClient
Client for scraping Google News via RSS feeds.
from serp import GoogleNewsClient
client = GoogleNewsClient(
language="tr", # Language code (tr, en, etc.)
country="TR", # Country code (TR, US, etc.)
time_range="d", # Time range: "h" (hour), "d" (day), "w" (week), "m" (month)
)
Methods
client.get_news(query, max_results=50, queries=None)
Get news articles for a search term.
Parameters:
query(str): Search term to find news formax_results(int): Maximum number of results (default: 50)queries(list[str]): Custom list of queries to use
Returns:
list[NewsResult]: List of NewsResult objects
quick_news(query, max_results=50, language="tr", country="TR")
Convenience function for quick news retrieval.
from serp import quick_news
news = await quick_news("Tesla", max_results=20, language="en", country="US")
NewsResult
Typed result object for Google News articles.
from serp import NewsResult
result = NewsResult(
title="Tesla announces new model",
url="https://news.google.com/rss/articles/...",
original_url="https://example.com/article",
published=datetime(2026, 5, 11, 8, 0, 0),
source="BBC",
description="Tesla unveiled...",
query="Tesla"
)
Attributes:
title(str): News headlineurl(str): Google News RSS URLoriginal_url(str): Original article URL (extracted from description)published(datetime): Publication datesource(str): News source name (e.g., "BBC", "NTV")description(str): News summary/snippetquery(str): Search query that returned this result
Methods:
to_dict(): Convert to dictionary
NewsSettings
Configuration for Google News scraping.
from serp import NewsSettings
settings = NewsSettings(
language="tr", # Language code
country="TR", # Country code
time_range="d", # Time range: "h", "d", "w", "m"
)
Attributes:
language(str): Language code (default: "tr")country(str): Country code (default: "TR")time_range(str): Time range filter (default: "d")
OutputFormatter
Structured output formatting for CLI tools and LLM integration.
from serp import OutputFormatter, OUTPUT_TEXT, OUTPUT_JSON
# Format search results
output = OutputFormatter.format_search_results(
results=search_results,
mode=OUTPUT_JSON,
query="python tutorial",
source="google",
)
# Format news results
output = OutputFormatter.format_news(
news_list=news_results,
mode=OUTPUT_JSON,
search_term="Tesla",
language="en",
country="US",
)
# Format fetch results
output = OutputFormatter.format_fetch(
content=page_content,
url="https://example.com",
char_count=len(page_content),
mode=OUTPUT_JSON,
)
Constants:
OUTPUT_TEXT: Text output mode (human-readable)OUTPUT_JSON: JSON output mode (machine-parseable)
Classes:
OutputFormatter: Main interface with staticformat_*methodsTextFormatter: Human-readable text formattingJSONFormatter: JSON output formattingOutputError: Structured error representation
Utility Functions
set_log_level(level)
Set the log level for all serp loggers.
from serp import set_log_level
set_log_level("DEBUG") # Enable debug logging
set_log_level("WARNING") # Only show warnings and errors
compress_content(content, threshold=10000, head_pct=0.35, middle_pct=0.15, tail_pct=0.50)
Compress long content by extracting head, middle, and tail portions, joined with a truncation marker.
from serp import compress_content, CompressionMeta
compressed, meta = compress_content(
content,
threshold=10000, # Content longer than this gets compressed
head_pct=0.35, # 35% of target for the head
middle_pct=0.15, # 15% of target for the middle
tail_pct=0.50, # 50% of target for the tail
)
if meta.was_truncated:
print(f"Original: {meta.original_length:,} chars")
print(f"Compressed: {meta.compressed_length:,} chars")
print(f"Truncated: {meta.truncated_chars:,} chars")
else:
print("Content was within threshold, returned as-is")
Parameters:
| Param | Type | Default | Description |
|---|---|---|---|
content |
str | - | The content string to compress |
threshold |
int | 10000 |
Char length threshold. Content shorter than this is unchanged. |
head_pct |
float | 0.35 |
Fraction of target length for the head portion |
middle_pct |
float | 0.15 |
Fraction of target length for the middle portion |
tail_pct |
float | 0.50 |
Fraction of target length for the tail portion |
Returns:
tuple[str, CompressionMeta]: (compressed_content, metadata)
CompressionMeta attributes:
| Attribute | Type | Description |
|---|---|---|
original_length |
int | Character count before compression |
compressed_length |
int | Character count after compression |
truncated_chars |
int | Number of characters removed |
was_truncated |
bool | Whether content exceeded threshold and was truncated |
Exceptions
| Exception | Description |
|---|---|
ProxyError |
All proxies failed |
CaptchaError |
CAPTCHA could not be solved after retries |
PageTimeoutError |
Page load timeout |
ParseError |
Failed to parse results |
VirtualScreenRequiredError |
No virtual display available for non-headless mode |
Constants
| Constant | Description |
|---|---|
MAX_RETRIES |
Maximum retry attempts (default: 3) |
TIMEOUT_MS |
Page timeout in milliseconds (default: 30000) |
USER_AGENTS |
List of user agent strings for rotation |
Interactive CLI
The package includes an interactive CLI tool for testing:
python main.py
python main.py --format json # JSON output for LLM integration
python main.py -f text # Human-readable text output (default)
python main.py --compress # Enable content compression for URL fetch
python main.py --compress -f json # Compress + JSON output
Features:
- SERP Search testing
- URL Fetch testing (with optional
--compressflag for long content) - Google News RSS testing
- Dual output format (text/JSON)
- Proxy status checking
REST API (Optional)
The package includes an optional FastAPI-based REST API for programmatic access.
Installation
pip install serp-scraper[api]
Running the API
python -m api.main
# Or with uvicorn directly:
uvicorn api.main:app --host 0.0.0.0 --port 8000
API Endpoints
GET /health- Health checkPOST /api/v1/search- Search SERP resultsPOST /api/v1/fetch- Fetch URL content (supports optionalcompressfield)POST /api/v1/news- Get Google News articlesPOST /api/v1/scholar- Search Google Scholar articles
Authentication
The API uses API key authentication. Set the X-API-Key header in your requests.
Rate Limiting
The API includes rate limiting middleware to prevent abuse.
Project Structure
serp-scraper/
├── serp/ # Core library
│ ├── __init__.py # Public API exports
│ ├── client.py # SerpClient and quick functions
│ ├── cache.py # Disk-based caching (DiskCache, NullCache)
│ ├── compression.py # Content compression (compress_content, CompressionMeta)
│ ├── config.py # Configuration constants (BING_URL_TEMPLATE, USER_AGENTS)
│ ├── config_pydantic.py # Pydantic-based SerpConfig
│ ├── google_news.py # Google News RSS client
│ ├── google_scholar.py # Google Scholar client
│ ├── http_search.py # HTTP-based search (httpx)
│ ├── output_formatter.py # Text and JSON output formatting
│ ├── parsers.py # Browser-based parsing (nodriver)
│ ├── types.py # Type definitions (SearchResult, RetryPolicy, etc.)
│ └── utils.py # Exceptions, helpers, constants
├── api/ # REST API (optional)
│ ├── __init__.py
│ ├── main.py # FastAPI application
│ ├── config.py # API configuration (APISettings, RateLimitConfig)
│ ├── deps.py # DI: auth, rate limit, semaphore
│ ├── exceptions.py # Custom exception classes
│ ├── models/ # Pydantic models
│ │ ├── __init__.py
│ │ ├── requests.py # Request schemas
│ │ └── responses.py # Response schemas
│ ├── routers/ # API routes
│ │ ├── __init__.py
│ │ ├── health.py # GET /health
│ │ ├── search.py # POST /api/v1/search
│ │ ├── fetch.py # POST /api/v1/fetch
│ │ ├── news.py # POST /api/v1/news
│ │ └── scholar.py # POST /api/v1/scholar
│ ├── middleware/ # Middleware
│ │ ├── __init__.py
│ │ ├── rate_limit.py # Sliding window rate limiter
│ │ └── logging_middleware.py
│ ├── utils/ # Backward-compat re-exports
│ │ ├── __init__.py
│ │ └── compression.py # Re-exports from serp.compression
│ └── cli/ # API CLI tools
│ ├── __init__.py
│ └── keys.py # API key generation
├── tests/ # Test suite
│ ├── __init__.py
│ ├── conftest.py # Fixtures, global state reset
│ ├── test_serp.py # Core library tests
│ ├── test_cache.py # Cache tests
│ ├── test_google_news.py # Google News tests
│ └── api/ # API tests
│ └── __init__.py
├── main.py # Interactive CLI tool
├── .env.example # Environment variables template
├── pyproject.toml # Project metadata
└── README.md # This file
Testing
Run the test suite:
pytest
Run with coverage:
pytest --cov=serp --cov-report=html
Run specific test file:
pytest tests/test_serp.py
Architecture
Search Flow
- Check cache for existing results (if
use_cache=True) - Load proxy configuration from file or environment
- Select random proxy (or DataImpulse if configured)
- Create browser with stealth settings
- Navigate to search URL
- Wait for results to load
- Check for CAPTCHA
- Parse organic results
- Cache results before returning
Fetch Strategy (per docs/Scraper Strategy.md)
For Google Scholar, Google News, and URL Fetch:
- Primary: BS4 (BeautifulSoup4) + HTTP - fast, lightweight
- Fallback: Browser (nodriver) - triggered on BS4 failure
For Google and Bing SERP:
- Always uses browser (nodriver) because JavaScript is required to render results
Search Methods
SERP Search (Google/Bing):
- Always uses
nodriverfor stealth Chrome automation - More reliable, harder to detect
- Slower due to browser overhead
URL Fetch / Google News / Google Scholar:
- Default: BS4-first, browser fallback on failure
- BS4 method: Uses
httpxfor direct HTTP requests with BeautifulSoup4 parsing - Browser fallback: Uses
nodriverwhen BS4 fails (empty content, CAPTCHA, parse errors, timeouts)
Caching
The caching system uses a disk-based approach:
- Cache entries stored as JSON files in
.cache/serp/ - Keys are SHA256 hashes of query parameters
- Automatic expiration based on TTL
- Can be disabled via
cache_enabled=FalseorSERP_CACHE_ENABLED=false
Content Compression
The library includes a built-in content compression utility for truncating long content while preserving key information:
Algorithm:
- If content length ≤
threshold(default: 10,000 chars), return unchanged - Target compressed length = 45% of threshold (~4,500 chars)
- Extract three portions: head (35%), middle (15%), tail (50%)
- Join with a truncation marker:
\n\n-- X,XXX chars truncated --\n\n
Usage scenarios:
- LLM context optimization: Reduce token count while preserving document structure
- API responses: The REST API
/fetchendpoint supports"compress": true - Direct library use:
from serp import compress_contentorclient.fetch(url, compress=True)
Error Handling
The library provides specific exceptions for different failure modes:
- ProxyError: All configured proxies failed or returned errors
- CaptchaError: Search engine detected automation and presented CAPTCHA
- PageTimeoutError: Page did not load within the timeout period
- ParseError: Page loaded but results could not be parsed
- VirtualScreenRequiredError: Non-headless mode selected but no virtual display available (use
xvfb-run, setDISPLAY, or useheadless=True)
Dependencies
| Package | Version | Purpose |
|---|---|---|
| nodriver | >=0.50.0 | Stealth Chrome automation |
| markdownify | >=0.12.0 | HTML to Markdown conversion |
| httpx | >=0.25.0 | Async HTTP client |
| beautifulsoup4 | >=4.12.0 | HTML parsing |
| pydantic | >=2.0.0 | Configuration validation |
| pydantic-settings | >=2.0.0 | Pydantic settings integration |
| python-dotenv | >=1.0.0 | .env file support |
API Dependencies (Optional)
Install with pip install serp-scraper[api] for REST API support:
| Package | Version | Purpose |
|---|---|---|
| fastapi | >=0.100.0 | REST API framework |
| uvicorn[standard] | >=0.23.0 | ASGI server |
| passlib[bcrypt] | >=1.7.4 | Password hashing |
| bcrypt | <4.1.0 | Password hashing |
Development Dependencies
| Package | Version | Purpose |
|---|---|---|
| pytest | >=7.0.0 | Testing framework |
| pytest-asyncio | >=0.21.0 | Async test support |
| pytest-cov | >=4.0.0 | Coverage reporting |
| pytest-mock | >=3.10.0 | Mocking utilities |
| pytest-httpserver | >=1.0.0 | HTTP server for testing |
| ruff | >=0.1.0 | Linting |
| mypy | >=1.0.0 | Type checking |
| build | >=1.0.0 | Package building |
| twine | >=4.0.0 | Package publishing |
License
MIT License - see LICENSE file for details.
Author
neuronaline - flashneuron@proton.me
Links
Disclaimer
This software is provided for educational and legitimate purposes only. Users are responsible for ensuring their use complies with search engine Terms of Service and applicable laws. The authors assume no liability for misuse of this software.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file serp_scraper-2.0.13.tar.gz.
File metadata
- Download URL: serp_scraper-2.0.13.tar.gz
- Upload date:
- Size: 99.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfca897af996cda74d393974ba5922297add29a2e1b97e9afbeb7acaeaad3366
|
|
| MD5 |
c84f9d9e6ac65b58fcd7f47c2e79c519
|
|
| BLAKE2b-256 |
4098fcbc1b4a07afea2393c9746071d86fa1016e0694403d0c9f87127dfd7127
|
File details
Details for the file serp_scraper-2.0.13-py3-none-any.whl.
File metadata
- Download URL: serp_scraper-2.0.13-py3-none-any.whl
- Upload date:
- Size: 92.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a6e9c16cabab5bd10bce1a2efac5688ef4811c7489d57a30c598fc2e3749913
|
|
| MD5 |
a073410228873fc9a9e3ff8edbd62d54
|
|
| BLAKE2b-256 |
cce33658e404c5474e62259e41540465f04c25e1c25545d554f3258ffad90bc8
|