Skip to main content

Bright Data integration for Haystack - web scraping, SERP API, and data extraction from 45+ websites

Project description

Haystack x Bright Data Integration

PyPI version Python Version License

Integrate Bright Data's powerful web scraping and data extraction capabilities into your Haystack pipelines. This package provides three Haystack components for:

  • 🔍 SERP API - Search engine results from Google, Bing, Yahoo, and more
  • 🌐 Web Unlocker - Access geo-restricted and bot-protected websites
  • 📊 Web Scraper - Extract structured data from 43+ supported websites

Features

  • Seamless Haystack Integration - Works natively with Haystack 2.0+ pipelines
  • 43+ Supported Datasets - Extract data from Amazon, LinkedIn, Instagram, Facebook, TikTok, YouTube, and more
  • Geo-Targeting - Access content from specific countries
  • Anti-Bot Bypass - Automatically handle CAPTCHAs and bot detection
  • Structured Data - Get clean, structured JSON data ready for RAG pipelines
  • Async Support - Built-in async support for high-performance applications

Installation

pip install haystack-brightdata

Quick Start

Prerequisites

  1. Get your Bright Data API key from https://brightdata.com/cp/api_access
  2. Set the environment variable:
export BRIGHT_DATA_API_KEY="your-api-key-here"

Example 1: SERP Search

from haystack_brightdata import BrightDataSERP

# Initialize the component
serp = BrightDataSERP()

# Execute a search
result = serp.run(
    query="Haystack AI framework tutorials",
    num_results=10,
    country="us"
)

print(result["results"])  # Parsed JSON results

Example 2: Web Unlocker

from haystack_brightdata import BrightDataUnlocker

# Initialize the component
unlocker = BrightDataUnlocker()

# Access a restricted website
result = unlocker.run(
    url="https://example.com",
    country="gb",
    output_format="markdown"
)

print(result["content"])  # Clean markdown content

Example 3: Web Scraper

from haystack_brightdata import BrightDataWebScraper

# Initialize the component
scraper = BrightDataWebScraper()

# Extract Amazon product data
result = scraper.run(
    dataset="amazon_product",
    url="https://www.amazon.com/dp/B08N5WRWNW"
)

print(result["data"])  # Structured JSON data

Example 4: In a Haystack Pipeline

from haystack import Pipeline
from haystack_brightdata import BrightDataSERP

# Create a pipeline
pipeline = Pipeline()
pipeline.add_component("search", BrightDataSERP())

# Run the pipeline
result = pipeline.run({
    "search": {
        "query": "Python web scraping",
        "num_results": 20
    }
})

print(result["search"]["results"])

Components

BrightDataSERP

Execute search queries across multiple search engines with geo-targeting and result parsing.

Parameters:

  • bright_data_api_key (Optional[str]): API key (defaults to BRIGHT_DATA_API_KEY env var)
  • zone (str): Bright Data zone name (default: "serp")
  • default_search_engine (str): Default search engine (default: "google")
  • default_country (str): Default country code (default: "us")
  • default_language (str): Default language code (default: "en")
  • default_num_results (int): Default number of results (default: 10)

Outputs:

  • results (str): Search results as JSON string (when parse_results=True, default) or raw HTML

BrightDataUnlocker

Access geo-restricted and bot-protected websites with automatic CAPTCHA solving.

Parameters:

  • bright_data_api_key (Optional[str]): API key (defaults to BRIGHT_DATA_API_KEY env var)
  • zone (str): Bright Data zone name (default: "unlocker")
  • default_country (str): Default country code (default: "us")
  • default_output_format (str): Default output format - html, markdown, or screenshot (default: "html")

Outputs:

  • content (str): Web page content in the specified format

BrightDataWebScraper

Extract structured data from 43+ supported websites.

Parameters:

  • bright_data_api_key (Optional[str]): API key (defaults to BRIGHT_DATA_API_KEY env var)
  • default_include_errors (bool): Include errors in output (default: False)

Outputs:

  • data (str): Structured data as JSON string

Helper Methods:

# Get all supported datasets
datasets = BrightDataWebScraper.get_supported_datasets()

# Get info about a specific dataset
info = BrightDataWebScraper.get_dataset_info("amazon_product")

Supported Datasets (43+)

E-commerce (10)

  • Amazon: Products, Reviews, Search, Bestsellers
  • Walmart: Products, Seller
  • eBay, Home Depot, Zara, Etsy, Best Buy

LinkedIn (5)

  • Person Profile, Company Profile, Job Listings, Posts, People Search

Social Media (16)

  • Instagram: Profiles, Posts, Reels, Comments
  • Facebook: Posts, Marketplace, Company Reviews, Events
  • TikTok: Profiles, Posts, Shop, Comments
  • YouTube: Profiles, Videos, Comments
  • X/Twitter: Posts
  • Reddit: Posts

Business Intelligence (2)

  • Crunchbase, ZoomInfo

Search & Commerce (6)

  • Google Maps Reviews, Google Shopping, Google Play Store
  • Apple App Store, Zillow, Booking.com

Other (5)

  • GitHub, Yahoo Finance, Reuters

See full dataset list

Advanced Usage

Custom Zone Configuration

serp = BrightDataSERP(zone="my_custom_serp_zone")

Geo-Targeted Search

result = serp.run(
    query="local restaurants",
    country="fr",  # France
    language="fr",
    num_results=20
)

Multi-Format Web Unlocker

# Get as markdown
markdown = unlocker.run(url="https://example.com", output_format="markdown")

# Get as screenshot
screenshot = unlocker.run(url="https://example.com", output_format="screenshot")

Dataset-Specific Parameters

# LinkedIn people search
result = scraper.run(
    dataset="linkedin_people_search",
    url="https://www.linkedin.com",
    first_name="John",
    last_name="Doe"
)

# Google Maps reviews (last 7 days)
result = scraper.run(
    dataset="google_maps_reviews",
    url="https://www.google.com/maps/place/...",
    days_limit="7"
)

Environment Variables

  • BRIGHT_DATA_API_KEY - Your Bright Data API key (required)
  • REQUESTS_CA_BUNDLE - Custom CA bundle for corporate proxies (optional)
  • SSL_CERT_FILE - Alternative SSL certificate file (optional)

Requirements

  • Python >= 3.8
  • haystack-ai >= 2.0.0
  • pydantic >= 2.0.0
  • requests >= 2.28.0
  • aiohttp >= 3.8.0

Examples

Check out the examples directory for more detailed examples:

  • example_serp.py - SERP API examples
  • example_unlocker.py - Web Unlocker examples
  • example_scraper.py - Web Scraper examples
  • example_pipeline.py - Pipeline integration examples

Documentation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

Acknowledgments


Note: You need a valid Bright Data subscription to use this package. Get started at brightdata.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haystack_brightdata-0.1.0.tar.gz (24.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

haystack_brightdata-0.1.0-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file haystack_brightdata-0.1.0.tar.gz.

File metadata

  • Download URL: haystack_brightdata-0.1.0.tar.gz
  • Upload date:
  • Size: 24.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for haystack_brightdata-0.1.0.tar.gz
Algorithm Hash digest
SHA256 089dc129fb25252ea72d5605f626663a950732c4617ff917ac6d76a29f5eeeb9
MD5 a533a6966ce1d0f726378686ea3196f7
BLAKE2b-256 180b1d18b088c476c050790665f13ccc07d584b3ba8c75cff34ed310b7d0bd3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for haystack_brightdata-0.1.0.tar.gz:

Publisher: publish.yml on brightdata/haystack-brightdata

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file haystack_brightdata-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for haystack_brightdata-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a27fbf04295a598668e2ad3bc9258e59280c3ceead725326814795fc2def879d
MD5 f688d12d963b180faa7aad59bda4594b
BLAKE2b-256 c798d5d97f2bbcf3a0d78c707fd88b7228e21d9169194ba9121597cc83ff05e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for haystack_brightdata-0.1.0-py3-none-any.whl:

Publisher: publish.yml on brightdata/haystack-brightdata

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page