Skip to main content

Fast, lightweight Google search scraper with stealth mode

Project description

Google Search Scraper

PyPI version License: MIT Python 3.8+

A fast, lightweight, and easy-to-use Python package for scraping Google search results with built-in stealth mode to avoid detection.

✨ Features

  • 🚀 Fast: Optimized for speed with minimal overhead
  • 🥷 Stealth Mode: Built-in anti-detection features
  • 🎯 Simple API: Easy to use for both beginners and experts
  • 📦 Zero Config: Playwright browser auto-installs on package installation
  • 🔧 Flexible: Highly configurable with sensible defaults
  • 💻 CLI Support: Use from command line or as a Python module
  • 🎨 Multiple Output Formats: JSON and text output supported
  • 📄 Content Extraction: Extract and analyze full page content from search results
  • 💾 Auto-Save: Automatically save results to file with full content
  • 🧹 Clean Text: Intelligent HTML parsing and text extraction

📦 Installation

pip install google-search-aj

The package will automatically install Playwright and download the Chromium browser during installation.

If automatic installation fails, manually install Playwright browsers:

playwright install chromium

🚀 Quick Start

Python API

from google_search_scraper import search

# Simple search
results = search("python tutorial")
print(results.urls)
# ['https://docs.python.org/3/tutorial/', 'https://www.w3schools.com/python/', ...]

# Access the direct answer (if available)
print(results.answer)
# 'Python is a high-level, general-purpose programming language...'

# Get more details
print(f"Found {results.total_results} results in {results.search_time:.2f} seconds")

# Extract full page content
results = search("machine learning", max_results=3, extract_content=True)
for content in results.contents:
    print(f"{content.title}: {content.word_count} words")
    print(f"Preview: {content.content[:200]}...")

# Auto-save to file with content extraction
results = search(
    "python tutorial", 
    extract_content=True, 
    save_to_file=True,  # Auto-save enabled
    output_file="search_results.txt"  # Custom filename
)
# Results automatically saved with full content!

Command Line

# Simple search
google-search "python tutorial"

# Limit results
google-search "best restaurants" --max-results 20

# Save to file
google-search "machine learning" --output results.txt

# JSON output
google-search "data science" --format json

# Run with visible browser (debugging)
google-search "web scraping" --visible

📖 Usage Examples

Basic Usage

from google_search_scraper import search

# Default: 10 results with answer extraction
results = search("artificial intelligence")

for i, url in enumerate(results.urls, 1):
    print(f"{i}. {url}")

Advanced Usage

from google_search_scraper import GoogleSearchScraper

# Create a scraper with custom settings
scraper = GoogleSearchScraper(
    max_results=20,          # Get more results
    timeout=60000,           # Increase timeout
    headless=False,          # Show browser (for debugging)
    stealth_mode=True,       # Enable anti-detection (default)
    extract_content=True     # Extract page content
)

# Perform search
results = scraper.search("python web scraping", extract_answer=True)

# Access results
print(f"Query: {results.query}")
print(f"Time: {results.search_time:.2f}s")
print(f"Answer: {results.answer}")
print(f"URLs: {len(results.urls)}")
print(f"Content extracted: {len(results.contents)} pages")

# Convert to dictionary
data = results.to_dict()

Content Extraction

from google_search_scraper import search

# Extract content from search results
results = search("machine learning tutorial", max_results=5, extract_content=True)

# Access extracted content
for content in results.contents:
    if not content.error:
        print(f"\nTitle: {content.title}")
        print(f"URL: {content.url}")
        print(f"Word Count: {content.word_count:,}")
        print(f"Content Preview: {content.content[:200]}...")
    else:
        print(f"Failed to extract: {content.url} - {content.error}")

# Auto-save to file
results = search(
    "python web scraping",
    extract_content=True,
    save_to_file=True,  # Auto-save enabled
    output_file="my_search.txt"
)
print(f"✓ Results saved with full content!")

# Or save manually
results = search("AI tutorial", extract_content=True)
results.save_to_file("ai_results.txt")

Multiple Searches

from google_search_scraper import search

queries = ["python", "javascript", "rust"]

for query in queries:
    results = search(query, max_results=5)
    print(f"\n{query}: {len(results.urls)} results")
    print(results.urls[0] if results.urls else "No results")

Error Handling

from google_search_scraper import search
from google_search_scraper.exceptions import (
    GoogleSearchError,
    RateLimitError,
    BrowserError,
    SearchTimeoutError
)

try:
    results = search("test query", timeout=10000)
except RateLimitError:
    print("Being rate limited by Google. Try again later.")
except SearchTimeoutError:
    print("Search timed out. Try increasing the timeout.")
except BrowserError:
    print("Browser failed to launch.")
except GoogleSearchError as e:
    print(f"Search failed: {e}")

Batch Processing

from google_search_scraper import search
import time

queries = [
    "machine learning",
    "deep learning",
    "neural networks"
]

all_results = []

for query in queries:
    print(f"Searching: {query}")
    results = search(query, max_results=15)
    all_results.append(results)
    
    # Be respectful - add delay between searches
    time.sleep(5)

# Process results
for result in all_results:
    print(f"\n{result.query}:")
    print(f"  - Answer: {result.answer[:100] if result.answer else 'N/A'}")
    print(f"  - URLs: {len(result.urls)}")

🎯 API Reference

Main Functions

search(query, max_results=10, extract_answer=True, extract_content=False, headless=True, timeout=30000, save_to_file=False, output_file="search_results.txt")

Convenience function for quick searches.

Parameters:

  • query (str): Search query string
  • max_results (int): Maximum URLs to return (default: 10)
  • extract_answer (bool): Extract Google's direct answer (default: True)
  • extract_content (bool): Extract page content from URLs (default: False)
  • headless (bool): Run browser in headless mode (default: True)
  • timeout (int): Page load timeout in milliseconds (default: 30000)
  • save_to_file (bool): Automatically save results to file (default: False)
  • output_file (str): Name of the output file (default: search_results.txt)

Returns: SearchResult object

Classes

GoogleSearchScraper

Main scraper class with configurable options.

scraper = GoogleSearchScraper(
    max_results=10,
    timeout=30000,
    headless=True,
    stealth_mode=True,
    user_agent=None,
    extract_content=False
)

Methods:

  • search(query, extract_answer=True): Perform a search

SearchResult

Container for search results.

Attributes:

  • query (str): The search query
  • answer (str | None): Google's direct answer if available
  • urls (List[str]): List of result URLs
  • total_results (int): Number of URLs returned
  • search_time (float): Time taken for search in seconds
  • timestamp (float): Unix timestamp of search
  • contents (List[PageContent]): Extracted page content (if extract_content=True)

Methods:

  • to_dict(): Convert to dictionary
  • save_to_file(filename="search_results.txt"): Save results to text file with full content

PageContent

Container for extracted page content.

Attributes:

  • url (str): The page URL
  • title (str | None): Page title
  • content (str): Extracted clean text content
  • word_count (int): Number of words in content
  • error (str | None): Error message if extraction failed

Exceptions

  • GoogleSearchError: Base exception for all errors
  • RateLimitError: Raised when rate limited by Google
  • BrowserError: Raised when browser fails
  • SearchTimeoutError: Raised when search times out
  • NoResultsError: Raised when no results found

🎛️ CLI Reference

usage: google-search [-h] [-n N] [--no-answer] [--visible] [--timeout MS]
                     [-o FILE] [-f {text,json}] [-v] [-q]
                     [query]

positional arguments:
  query                 Search query (if not provided, enters interactive mode)

optional arguments:
  -h, --help            show this help message and exit
  -n N, --max-results N
                        Maximum number of results to return (default: 10)
  --no-answer           Skip extracting Google's direct answer
  --visible             Run browser in visible mode (for debugging)
  --timeout MS          Page load timeout in milliseconds (default: 30000)
  -o FILE, --output FILE
                        Save results to file
  -f {text,json}, --format {text,json}
                        Output format: text or json (default: text)
  -v, --version         show program's version number and exit
  -q, --quiet           Suppress all output except results

⚠️ Important Notes

Rate Limiting

Google may rate limit or block requests if you make too many searches too quickly. To avoid this:

  1. Add delays between searches (5-10 seconds recommended)
  2. Use residential proxies for large-scale scraping
  3. Respect robots.txt and Google's Terms of Service

Legal Considerations

Web scraping may have legal implications. This tool is for educational and research purposes. Users are responsible for:

  • Complying with Google's Terms of Service
  • Respecting robots.txt
  • Following applicable laws and regulations
  • Not using for commercial purposes without permission

Detection

While this package includes stealth features, Google continuously updates their detection methods. If you're being blocked:

  1. Increase delays between requests
  2. Use headless=False to run in visible mode
  3. Consider using residential proxies
  4. Rotate user agents

🛠️ Development

Setup Development Environment

# Clone the repository
git clone https://github.com/Aaditya17032002/google-search.git
cd google-search

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Install Playwright browsers
playwright install chromium

Running Tests

pytest tests/

Code Formatting

black google_search_scraper/
flake8 google_search_scraper/

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with Playwright
  • Inspired by the need for a simple, reliable Google search scraper

📧 Support

If you encounter any issues or have questions:

  • Open an issue on GitHub
  • Check existing issues for solutions

🔄 Changelog

v1.0.4 (2024-11-03)

  • Auto-save to file: Save results with full content automatically
  • Manual save with save_to_file() method
  • Comprehensive file format with all data

v1.0.3 (2024-11-03)

  • Content extraction: Extract full page content from URLs
  • BeautifulSoup4 integration for HTML parsing
  • Async content extraction for better performance
  • PageContent dataclass with title, content, word count

v1.0.0 (2024-11-03)

  • Initial release
  • Fast Google search scraping
  • Stealth mode with anti-detection
  • CLI and Python API
  • Auto-installation of Playwright
  • JSON and text output formats

See CHANGELOG.md for full version history.


Made with ❤️ by developers, for developers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

google_search_aj-1.0.6.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

google_search_aj-1.0.6-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file google_search_aj-1.0.6.tar.gz.

File metadata

  • Download URL: google_search_aj-1.0.6.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for google_search_aj-1.0.6.tar.gz
Algorithm Hash digest
SHA256 fa619837d9997bea093670ed6d0cc7f086d8a860aefb862082d281fb5b98120d
MD5 0fded2fe00ca98d74549ae715771d382
BLAKE2b-256 2d5a14a38e41f8d225123e04f30c84378213857f9f2bd52dcbbbc8946f8c7d06

See more details on using hashes here.

File details

Details for the file google_search_aj-1.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for google_search_aj-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 e75d4949d7dcb4dc1ea306633df90d56b020eead8946bd8fc669792bc294fce9
MD5 6f077d2b1d8e16dabf965d9a2372b8bf
BLAKE2b-256 3e69764c4feec9ce1f1c0242f641fa6bb95184de5dcf144f1a736a9abb2a7995

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page