Bright Data integration for Haystack - web scraping, SERP API, and data extraction from 45+ websites
Project description
Haystack x Bright Data Integration
Integrate Bright Data's powerful web scraping and data extraction capabilities into your Haystack pipelines. This package provides three Haystack components for:
- 🔍 SERP API - Search engine results from Google, Bing, Yahoo, and more
- 🌐 Web Unlocker - Access geo-restricted and bot-protected websites
- 📊 Web Scraper - Extract structured data from 43+ supported websites
Features
- Seamless Haystack Integration - Works natively with Haystack 2.0+ pipelines
- 43+ Supported Datasets - Extract data from Amazon, LinkedIn, Instagram, Facebook, TikTok, YouTube, and more
- Geo-Targeting - Access content from specific countries
- Anti-Bot Bypass - Automatically handle CAPTCHAs and bot detection
- Structured Data - Get clean, structured JSON data ready for RAG pipelines
- Async Support - Built-in async support for high-performance applications
Installation
pip install haystack-brightdata
Quick Start
Prerequisites
- Get your Bright Data API key from https://brightdata.com/cp/api_access
- Set the environment variable:
export BRIGHT_DATA_API_KEY="your-api-key-here"
Example 1: SERP Search
from haystack_brightdata import BrightDataSERP
# Initialize the component
serp = BrightDataSERP()
# Execute a search
result = serp.run(
query="Haystack AI framework tutorials",
num_results=10,
country="us"
)
print(result["results"]) # Parsed JSON results
Example 2: Web Unlocker
from haystack_brightdata import BrightDataUnlocker
# Initialize the component
unlocker = BrightDataUnlocker()
# Access a restricted website
result = unlocker.run(
url="https://example.com",
country="gb",
output_format="markdown"
)
print(result["content"]) # Clean markdown content
Example 3: Web Scraper
from haystack_brightdata import BrightDataWebScraper
# Initialize the component
scraper = BrightDataWebScraper()
# Extract Amazon product data
result = scraper.run(
dataset="amazon_product",
url="https://www.amazon.com/dp/B08N5WRWNW"
)
print(result["data"]) # Structured JSON data
Example 4: In a Haystack Pipeline
from haystack import Pipeline
from haystack_brightdata import BrightDataSERP
# Create a pipeline
pipeline = Pipeline()
pipeline.add_component("search", BrightDataSERP())
# Run the pipeline
result = pipeline.run({
"search": {
"query": "Python web scraping",
"num_results": 20
}
})
print(result["search"]["results"])
Components
BrightDataSERP
Execute search queries across multiple search engines with geo-targeting and result parsing.
Parameters:
bright_data_api_key(Optional[str]): API key (defaults toBRIGHT_DATA_API_KEYenv var)zone(str): Bright Data zone name (default: "serp")default_search_engine(str): Default search engine (default: "google")default_country(str): Default country code (default: "us")default_language(str): Default language code (default: "en")default_num_results(int): Default number of results (default: 10)
Outputs:
results(str): Search results as JSON string (whenparse_results=True, default) or raw HTML
BrightDataUnlocker
Access geo-restricted and bot-protected websites with automatic CAPTCHA solving.
Parameters:
bright_data_api_key(Optional[str]): API key (defaults toBRIGHT_DATA_API_KEYenv var)zone(str): Bright Data zone name (default: "unlocker")default_country(str): Default country code (default: "us")default_output_format(str): Default output format - html, markdown, or screenshot (default: "html")
Outputs:
content(str): Web page content in the specified format
BrightDataWebScraper
Extract structured data from 43+ supported websites.
Parameters:
bright_data_api_key(Optional[str]): API key (defaults toBRIGHT_DATA_API_KEYenv var)default_include_errors(bool): Include errors in output (default: False)
Outputs:
data(str): Structured data as JSON string
Helper Methods:
# Get all supported datasets
datasets = BrightDataWebScraper.get_supported_datasets()
# Get info about a specific dataset
info = BrightDataWebScraper.get_dataset_info("amazon_product")
Supported Datasets (43+)
E-commerce (10)
- Amazon: Products, Reviews, Search, Bestsellers
- Walmart: Products, Seller
- eBay, Home Depot, Zara, Etsy, Best Buy
LinkedIn (5)
- Person Profile, Company Profile, Job Listings, Posts, People Search
Social Media (16)
- Instagram: Profiles, Posts, Reels, Comments
- Facebook: Posts, Marketplace, Company Reviews, Events
- TikTok: Profiles, Posts, Shop, Comments
- YouTube: Profiles, Videos, Comments
- X/Twitter: Posts
- Reddit: Posts
Business Intelligence (2)
- Crunchbase, ZoomInfo
Search & Commerce (6)
- Google Maps Reviews, Google Shopping, Google Play Store
- Apple App Store, Zillow, Booking.com
Other (5)
- GitHub, Yahoo Finance, Reuters
Advanced Usage
Custom Zone Configuration
serp = BrightDataSERP(zone="my_custom_serp_zone")
Geo-Targeted Search
result = serp.run(
query="local restaurants",
country="fr", # France
language="fr",
num_results=20
)
Multi-Format Web Unlocker
# Get as markdown
markdown = unlocker.run(url="https://example.com", output_format="markdown")
# Get as screenshot
screenshot = unlocker.run(url="https://example.com", output_format="screenshot")
Dataset-Specific Parameters
# LinkedIn people search
result = scraper.run(
dataset="linkedin_people_search",
url="https://www.linkedin.com",
first_name="John",
last_name="Doe"
)
# Google Maps reviews (last 7 days)
result = scraper.run(
dataset="google_maps_reviews",
url="https://www.google.com/maps/place/...",
days_limit="7"
)
Environment Variables
BRIGHT_DATA_API_KEY- Your Bright Data API key (required)REQUESTS_CA_BUNDLE- Custom CA bundle for corporate proxies (optional)SSL_CERT_FILE- Alternative SSL certificate file (optional)
Requirements
- Python >= 3.8
- haystack-ai >= 2.0.0
- pydantic >= 2.0.0
- requests >= 2.28.0
- aiohttp >= 3.8.0
Examples
Check out the examples directory for more detailed examples:
example_serp.py- SERP API examplesexample_unlocker.py- Web Unlocker examplesexample_scraper.py- Web Scraper examplesexample_pipeline.py- Pipeline integration examples
Documentation
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Support
- Issues: GitHub Issues
- Bright Data Support: support@brightdata.com
- Haystack Community: Haystack Discord
Acknowledgments
- Built for Haystack by deepset
- Powered by Bright Data
Note: You need a valid Bright Data subscription to use this package. Get started at brightdata.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file haystack_brightdata-0.1.0.tar.gz.
File metadata
- Download URL: haystack_brightdata-0.1.0.tar.gz
- Upload date:
- Size: 24.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
089dc129fb25252ea72d5605f626663a950732c4617ff917ac6d76a29f5eeeb9
|
|
| MD5 |
a533a6966ce1d0f726378686ea3196f7
|
|
| BLAKE2b-256 |
180b1d18b088c476c050790665f13ccc07d584b3ba8c75cff34ed310b7d0bd3c
|
Provenance
The following attestation bundles were made for haystack_brightdata-0.1.0.tar.gz:
Publisher:
publish.yml on brightdata/haystack-brightdata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
haystack_brightdata-0.1.0.tar.gz -
Subject digest:
089dc129fb25252ea72d5605f626663a950732c4617ff917ac6d76a29f5eeeb9 - Sigstore transparency entry: 790568235
- Sigstore integration time:
-
Permalink:
brightdata/haystack-brightdata@f2149a26f88fe1ac83a8bdb6423c4ae53c0d3550 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/brightdata
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f2149a26f88fe1ac83a8bdb6423c4ae53c0d3550 -
Trigger Event:
release
-
Statement type:
File details
Details for the file haystack_brightdata-0.1.0-py3-none-any.whl.
File metadata
- Download URL: haystack_brightdata-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a27fbf04295a598668e2ad3bc9258e59280c3ceead725326814795fc2def879d
|
|
| MD5 |
f688d12d963b180faa7aad59bda4594b
|
|
| BLAKE2b-256 |
c798d5d97f2bbcf3a0d78c707fd88b7228e21d9169194ba9121597cc83ff05e6
|
Provenance
The following attestation bundles were made for haystack_brightdata-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on brightdata/haystack-brightdata
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
haystack_brightdata-0.1.0-py3-none-any.whl -
Subject digest:
a27fbf04295a598668e2ad3bc9258e59280c3ceead725326814795fc2def879d - Sigstore transparency entry: 790568237
- Sigstore integration time:
-
Permalink:
brightdata/haystack-brightdata@f2149a26f88fe1ac83a8bdb6423c4ae53c0d3550 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/brightdata
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f2149a26f88fe1ac83a8bdb6423c4ae53c0d3550 -
Trigger Event:
release
-
Statement type: