Skip to main content

Fast, lightweight Google search scraper with stealth mode

Project description

Google Search Scraper

PyPI version License: MIT Python 3.8+

A fast, lightweight, and easy-to-use Python package for scraping Google search results with built-in stealth mode to avoid detection.

✨ Features

  • 🚀 Fast: Optimized for speed with minimal overhead
  • 🥷 Stealth Mode: Built-in anti-detection features
  • 🎯 Simple API: Easy to use for both beginners and experts
  • 📦 Zero Config: Playwright browser auto-installs on package installation
  • 🔧 Flexible: Highly configurable with sensible defaults
  • 💻 CLI Support: Use from command line or as a Python module
  • 🎨 Multiple Output Formats: JSON and text output supported

📦 Installation

pip install google-search-scraper

The package will automatically install Playwright and download the Chromium browser during installation.

If automatic installation fails, manually install Playwright browsers:

playwright install chromium

🚀 Quick Start

Python API

from google_search_scraper import search

# Simple search
results = search("python tutorial")
print(results.urls)
# ['https://docs.python.org/3/tutorial/', 'https://www.w3schools.com/python/', ...]

# Access the direct answer (if available)
print(results.answer)
# 'Python is a high-level, general-purpose programming language...'

# Get more details
print(f"Found {results.total_results} results in {results.search_time:.2f} seconds")

Command Line

# Simple search
google-search "python tutorial"

# Limit results
google-search "best restaurants" --max-results 20

# Save to file
google-search "machine learning" --output results.txt

# JSON output
google-search "data science" --format json

# Run with visible browser (debugging)
google-search "web scraping" --visible

📖 Usage Examples

Basic Usage

from google_search_scraper import search

# Default: 10 results with answer extraction
results = search("artificial intelligence")

for i, url in enumerate(results.urls, 1):
    print(f"{i}. {url}")

Advanced Usage

from google_search_scraper import GoogleSearchScraper

# Create a scraper with custom settings
scraper = GoogleSearchScraper(
    max_results=20,          # Get more results
    timeout=60000,           # Increase timeout
    headless=False,          # Show browser (for debugging)
    stealth_mode=True        # Enable anti-detection (default)
)

# Perform search
results = scraper.search("python web scraping", extract_answer=True)

# Access results
print(f"Query: {results.query}")
print(f"Time: {results.search_time:.2f}s")
print(f"Answer: {results.answer}")
print(f"URLs: {len(results.urls)}")

# Convert to dictionary
data = results.to_dict()

Multiple Searches

from google_search_scraper import search

queries = ["python", "javascript", "rust"]

for query in queries:
    results = search(query, max_results=5)
    print(f"\n{query}: {len(results.urls)} results")
    print(results.urls[0] if results.urls else "No results")

Error Handling

from google_search_scraper import search
from google_search_scraper.exceptions import (
    GoogleSearchError,
    RateLimitError,
    BrowserError,
    SearchTimeoutError
)

try:
    results = search("test query", timeout=10000)
except RateLimitError:
    print("Being rate limited by Google. Try again later.")
except SearchTimeoutError:
    print("Search timed out. Try increasing the timeout.")
except BrowserError:
    print("Browser failed to launch.")
except GoogleSearchError as e:
    print(f"Search failed: {e}")

Batch Processing

from google_search_scraper import search
import time

queries = [
    "machine learning",
    "deep learning",
    "neural networks"
]

all_results = []

for query in queries:
    print(f"Searching: {query}")
    results = search(query, max_results=15)
    all_results.append(results)
    
    # Be respectful - add delay between searches
    time.sleep(5)

# Process results
for result in all_results:
    print(f"\n{result.query}:")
    print(f"  - Answer: {result.answer[:100] if result.answer else 'N/A'}")
    print(f"  - URLs: {len(result.urls)}")

🎯 API Reference

Main Functions

search(query, max_results=10, extract_answer=True, headless=True, timeout=30000)

Convenience function for quick searches.

Parameters:

  • query (str): Search query string
  • max_results (int): Maximum URLs to return (default: 10)
  • extract_answer (bool): Extract Google's direct answer (default: True)
  • headless (bool): Run browser in headless mode (default: True)
  • timeout (int): Page load timeout in milliseconds (default: 30000)

Returns: SearchResult object

Classes

GoogleSearchScraper

Main scraper class with configurable options.

scraper = GoogleSearchScraper(
    max_results=10,
    timeout=30000,
    headless=True,
    stealth_mode=True,
    user_agent=None
)

Methods:

  • search(query, extract_answer=True): Perform a search

SearchResult

Container for search results.

Attributes:

  • query (str): The search query
  • answer (str | None): Google's direct answer if available
  • urls (List[str]): List of result URLs
  • total_results (int): Number of URLs returned
  • search_time (float): Time taken for search in seconds
  • timestamp (float): Unix timestamp of search

Methods:

  • to_dict(): Convert to dictionary

Exceptions

  • GoogleSearchError: Base exception for all errors
  • RateLimitError: Raised when rate limited by Google
  • BrowserError: Raised when browser fails
  • SearchTimeoutError: Raised when search times out
  • NoResultsError: Raised when no results found

🎛️ CLI Reference

usage: google-search [-h] [-n N] [--no-answer] [--visible] [--timeout MS]
                     [-o FILE] [-f {text,json}] [-v] [-q]
                     [query]

positional arguments:
  query                 Search query (if not provided, enters interactive mode)

optional arguments:
  -h, --help            show this help message and exit
  -n N, --max-results N
                        Maximum number of results to return (default: 10)
  --no-answer           Skip extracting Google's direct answer
  --visible             Run browser in visible mode (for debugging)
  --timeout MS          Page load timeout in milliseconds (default: 30000)
  -o FILE, --output FILE
                        Save results to file
  -f {text,json}, --format {text,json}
                        Output format: text or json (default: text)
  -v, --version         show program's version number and exit
  -q, --quiet           Suppress all output except results

⚠️ Important Notes

Rate Limiting

Google may rate limit or block requests if you make too many searches too quickly. To avoid this:

  1. Add delays between searches (5-10 seconds recommended)
  2. Use residential proxies for large-scale scraping
  3. Respect robots.txt and Google's Terms of Service

Legal Considerations

Web scraping may have legal implications. This tool is for educational and research purposes. Users are responsible for:

  • Complying with Google's Terms of Service
  • Respecting robots.txt
  • Following applicable laws and regulations
  • Not using for commercial purposes without permission

Detection

While this package includes stealth features, Google continuously updates their detection methods. If you're being blocked:

  1. Increase delays between requests
  2. Use headless=False to run in visible mode
  3. Consider using residential proxies
  4. Rotate user agents

🛠️ Development

Setup Development Environment

# Clone the repository
git clone https://github.com/yourusername/google-search-scraper.git
cd google-search-scraper

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Install Playwright browsers
playwright install chromium

Running Tests

pytest tests/

Code Formatting

black google_search_scraper/
flake8 google_search_scraper/

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with Playwright
  • Inspired by the need for a simple, reliable Google search scraper

📧 Support

If you encounter any issues or have questions:

  • Open an issue on GitHub
  • Check existing issues for solutions

🔄 Changelog

v1.0.0 (2024-11-03)

  • Initial release
  • Fast Google search scraping
  • Stealth mode with anti-detection
  • CLI and Python API
  • Auto-installation of Playwright
  • JSON and text output formats

Made with ❤️ by developers, for developers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

google_search_aj-1.0.0.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

google_search_aj-1.0.0-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file google_search_aj-1.0.0.tar.gz.

File metadata

  • Download URL: google_search_aj-1.0.0.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for google_search_aj-1.0.0.tar.gz
Algorithm Hash digest
SHA256 3e99bfc0413c73f50b97a71d30bab96ec4c42cd81245ed2cfd54325ec7c8f2ee
MD5 3feeeac8edd17a8ebb814f8900483809
BLAKE2b-256 d9ee890648d8a7abb57d74bc1c7c7aa8e9d205eae781ff0fff549f78a806ec30

See more details on using hashes here.

File details

Details for the file google_search_aj-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for google_search_aj-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 febd3c71223e222f6652596e2e8ddd152de16d5339a5f7e04a93b970b15013b4
MD5 cb28ba3e3dda72fd7a513d19cbf845e2
BLAKE2b-256 aa6d9e27bd7d7a57bd473f81a9a475a4f6da52c2f977a59df8b86dec9eebdac4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page