Bright Data integration for Haystack - web scraping, SERP API, and data extraction from 45+ websites

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Haystack x Bright Data Integration

Integrate Bright Data's powerful web scraping and data extraction capabilities into your Haystack pipelines. This package provides three Haystack components for:

🔍 SERP API - Search engine results from Google, Bing, Yahoo, and more
🌐 Web Unlocker - Access geo-restricted and bot-protected websites
📊 Web Scraper - Extract structured data from 43+ supported websites

Features

Seamless Haystack Integration - Works natively with Haystack 2.0+ pipelines
43+ Supported Datasets - Extract data from Amazon, LinkedIn, Instagram, Facebook, TikTok, YouTube, and more
Geo-Targeting - Access content from specific countries
Anti-Bot Bypass - Automatically handle CAPTCHAs and bot detection
Structured Data - Get clean, structured JSON data ready for RAG pipelines
Async Support - Built-in async support for high-performance applications

Installation

pip install haystack-brightdata

Quick Start

Prerequisites

Get your Bright Data API key from https://brightdata.com/cp/api_access
Set the environment variable:

export BRIGHT_DATA_API_KEY="your-api-key-here"

Example 1: SERP Search

from haystack_brightdata import BrightDataSERP

# Initialize the component
serp = BrightDataSERP()

# Execute a search
result = serp.run(
    query="Haystack AI framework tutorials",
    num_results=10,
    country="us"
)

print(result["results"])  # Parsed JSON results

Example 2: Web Unlocker

from haystack_brightdata import BrightDataUnlocker

# Initialize the component
unlocker = BrightDataUnlocker()

# Access a restricted website
result = unlocker.run(
    url="https://example.com",
    country="gb",
    output_format="markdown"
)

print(result["content"])  # Clean markdown content

Example 3: Web Scraper

from haystack_brightdata import BrightDataWebScraper

# Initialize the component
scraper = BrightDataWebScraper()

# Extract Amazon product data
result = scraper.run(
    dataset="amazon_product",
    url="https://www.amazon.com/dp/B08N5WRWNW"
)

print(result["data"])  # Structured JSON data

Example 4: In a Haystack Pipeline

from haystack import Pipeline
from haystack_brightdata import BrightDataSERP

# Create a pipeline
pipeline = Pipeline()
pipeline.add_component("search", BrightDataSERP())

# Run the pipeline
result = pipeline.run({
    "search": {
        "query": "Python web scraping",
        "num_results": 20
    }
})

print(result["search"]["results"])

Components

BrightDataSERP

Execute search queries across multiple search engines with geo-targeting and result parsing.

Parameters:

bright_data_api_key (Optional[str]): API key (defaults to BRIGHT_DATA_API_KEY env var)
zone (str): Bright Data zone name (default: "serp")
default_search_engine (str): Default search engine (default: "google")
default_country (str): Default country code (default: "us")
default_language (str): Default language code (default: "en")
default_num_results (int): Default number of results (default: 10)

Outputs:

results (str): Search results as JSON string (when parse_results=True, default) or raw HTML

BrightDataUnlocker

Access geo-restricted and bot-protected websites with automatic CAPTCHA solving.

Parameters:

bright_data_api_key (Optional[str]): API key (defaults to BRIGHT_DATA_API_KEY env var)
zone (str): Bright Data zone name (default: "unlocker")
default_country (str): Default country code (default: "us")
default_output_format (str): Default output format - html, markdown, or screenshot (default: "html")

Outputs:

content (str): Web page content in the specified format

BrightDataWebScraper

Extract structured data from 43+ supported websites.

Parameters:

bright_data_api_key (Optional[str]): API key (defaults to BRIGHT_DATA_API_KEY env var)
default_include_errors (bool): Include errors in output (default: False)

Outputs:

data (str): Structured data as JSON string

Helper Methods:

# Get all supported datasets
datasets = BrightDataWebScraper.get_supported_datasets()

# Get info about a specific dataset
info = BrightDataWebScraper.get_dataset_info("amazon_product")

Supported Datasets (43+)

E-commerce (10)

Amazon: Products, Reviews, Search, Bestsellers
Walmart: Products, Seller
eBay, Home Depot, Zara, Etsy, Best Buy

LinkedIn (5)

Person Profile, Company Profile, Job Listings, Posts, People Search

Social Media (16)

Instagram: Profiles, Posts, Reels, Comments
Facebook: Posts, Marketplace, Company Reviews, Events
TikTok: Profiles, Posts, Shop, Comments
YouTube: Profiles, Videos, Comments
X/Twitter: Posts
Reddit: Posts

Business Intelligence (2)

Crunchbase, ZoomInfo

Search & Commerce (6)

Google Maps Reviews, Google Shopping, Google Play Store
Apple App Store, Zillow, Booking.com

Other (5)

GitHub, Yahoo Finance, Reuters

See full dataset list

Advanced Usage

Custom Zone Configuration

serp = BrightDataSERP(zone="my_custom_serp_zone")

Geo-Targeted Search

result = serp.run(
    query="local restaurants",
    country="fr",  # France
    language="fr",
    num_results=20
)

Multi-Format Web Unlocker

# Get as markdown
markdown = unlocker.run(url="https://example.com", output_format="markdown")

# Get as screenshot
screenshot = unlocker.run(url="https://example.com", output_format="screenshot")

Dataset-Specific Parameters

# LinkedIn people search
result = scraper.run(
    dataset="linkedin_people_search",
    url="https://www.linkedin.com",
    first_name="John",
    last_name="Doe"
)

# Google Maps reviews (last 7 days)
result = scraper.run(
    dataset="google_maps_reviews",
    url="https://www.google.com/maps/place/...",
    days_limit="7"
)

Environment Variables

BRIGHT_DATA_API_KEY - Your Bright Data API key (required)
REQUESTS_CA_BUNDLE - Custom CA bundle for corporate proxies (optional)
SSL_CERT_FILE - Alternative SSL certificate file (optional)

Requirements

Python >= 3.8
haystack-ai >= 2.0.0
pydantic >= 2.0.0
requests >= 2.28.0
aiohttp >= 3.8.0

Examples

Check out the examples directory for more detailed examples:

example_serp.py - SERP API examples
example_unlocker.py - Web Unlocker examples
example_scraper.py - Web Scraper examples
example_pipeline.py - Pipeline integration examples

Documentation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

Issues: GitHub Issues
Bright Data Support: support@brightdata.com
Haystack Community: Haystack Discord

Acknowledgments

Built for Haystack by deepset
Powered by Bright Data

Note: You need a valid Bright Data subscription to use this package. Get started at brightdata.com.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Meirk

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jan 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haystack_brightdata-0.1.0.tar.gz (24.6 kB view details)

Uploaded Jan 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

haystack_brightdata-0.1.0-py3-none-any.whl (22.4 kB view details)

Uploaded Jan 4, 2026 Python 3

File details

Details for the file haystack_brightdata-0.1.0.tar.gz.

File metadata

Download URL: haystack_brightdata-0.1.0.tar.gz
Upload date: Jan 4, 2026
Size: 24.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for haystack_brightdata-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`089dc129fb25252ea72d5605f626663a950732c4617ff917ac6d76a29f5eeeb9`
MD5	`a533a6966ce1d0f726378686ea3196f7`
BLAKE2b-256	`180b1d18b088c476c050790665f13ccc07d584b3ba8c75cff34ed310b7d0bd3c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for haystack_brightdata-0.1.0.tar.gz:

Publisher: publish.yml on brightdata/haystack-brightdata

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: haystack_brightdata-0.1.0.tar.gz
- Subject digest: 089dc129fb25252ea72d5605f626663a950732c4617ff917ac6d76a29f5eeeb9
- Sigstore transparency entry: 790568235
- Sigstore integration time: Jan 4, 2026
Source repository:
- Permalink: brightdata/haystack-brightdata@f2149a26f88fe1ac83a8bdb6423c4ae53c0d3550
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/brightdata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f2149a26f88fe1ac83a8bdb6423c4ae53c0d3550
- Trigger Event: release

File details

Details for the file haystack_brightdata-0.1.0-py3-none-any.whl.

File metadata

Download URL: haystack_brightdata-0.1.0-py3-none-any.whl
Upload date: Jan 4, 2026
Size: 22.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for haystack_brightdata-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a27fbf04295a598668e2ad3bc9258e59280c3ceead725326814795fc2def879d`
MD5	`f688d12d963b180faa7aad59bda4594b`
BLAKE2b-256	`c798d5d97f2bbcf3a0d78c707fd88b7228e21d9169194ba9121597cc83ff05e6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for haystack_brightdata-0.1.0-py3-none-any.whl:

Publisher: publish.yml on brightdata/haystack-brightdata

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: haystack_brightdata-0.1.0-py3-none-any.whl
- Subject digest: a27fbf04295a598668e2ad3bc9258e59280c3ceead725326814795fc2def879d
- Sigstore transparency entry: 790568237
- Sigstore integration time: Jan 4, 2026
Source repository:
- Permalink: brightdata/haystack-brightdata@f2149a26f88fe1ac83a8bdb6423c4ae53c0d3550
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/brightdata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f2149a26f88fe1ac83a8bdb6423c4ae53c0d3550
- Trigger Event: release

haystack-brightdata 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Haystack x Bright Data Integration

Features

Installation

Quick Start

Prerequisites

Example 1: SERP Search

Example 2: Web Unlocker

Example 3: Web Scraper

Example 4: In a Haystack Pipeline

Components

BrightDataSERP

BrightDataUnlocker

BrightDataWebScraper

Supported Datasets (43+)

E-commerce (10)

LinkedIn (5)

Social Media (16)

Business Intelligence (2)

Search & Commerce (6)

Other (5)

Advanced Usage

Custom Zone Configuration

Geo-Targeted Search

Multi-Format Web Unlocker

Dataset-Specific Parameters

Environment Variables

Requirements

Examples

Documentation

Contributing

License

Support

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance