Thanks to visit codestin.com
Credit goes to docs.brightdata.com

Skip to main content

Building an AI startup?

You might be eligible for our Startup Program. Get fully funded access to the infrastructure you’re reading about right now (up to $50K value).
Integrating Bright Data with Haystack enhances your RAG pipelines and AI applications with reliable, scalable web data extraction for real-world use cases. The haystack-brightdata Python package is the official Haystack integration for Bright Data, including support for:
  • Bright Data Web Scraper - Extract structured data from 45+ supported websites including Amazon, LinkedIn, Instagram, Facebook, TikTok, YouTube, and more using Bright Data’s Dataset API.
  • Bright Data SERP - Query search engines (Google, Bing, Yahoo) with geo-targeting and language customization for real-time search results.
  • Bright Data Unlocker - Access geo-restricted and bot-protected websites, bypass CAPTCHAs and anti-bot measures to extract content in multiple formats.

How to Integrate Bright Data With Haystack

1

Obtain Your Bright Data API Key

2

Install the Bright Data Integration

Install the Bright Data integration package for Haystack by running the following command:
pip install haystack-brightdata
3

Set the environment variable

Set your Bright Data API key as an environment variable:
import os
os.environ["BRIGHT_DATA_API_KEY"] = "your-api-key"
4

Select your preferred Bright Data component

The Bright Data + Haystack integration currently supports:
Extract structured data from 45+ supported websites, including e-commerce, social media, and business intelligence platforms.
from haystack_brightdata import BrightDataWebScraper
import os

# Set your API key
os.environ["BRIGHT_DATA_API_KEY"] = "your-api-key"

# Initialize the scraper
scraper = BrightDataWebScraper()

# Extract Amazon product data
result = scraper.run(
    dataset="amazon_product",
    url="https://www.amazon.com/dp/B08N5WRWNW"
)
print(result["data"])

RAG Pipeline Examples

Product Data RAG Pipeline

Build a Retrieval-Augmented Generation (RAG) pipeline using Bright Data to extract product data from Amazon and answer questions about products:
import os
from haystack import Pipeline, Document
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import ChatMessage
from haystack_brightdata import BrightDataWebScraper
import json

# Set API keys
os.environ["BRIGHT_DATA_API_KEY"] = "your-brightdata-api-key"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

# Initialize components
scraper = BrightDataWebScraper()
document_store = InMemoryDocumentStore()
docs_embedder = OpenAIDocumentEmbedder()
text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIChatGenerator()

# Scrape product data from multiple Amazon products
product_urls = [
    "https://www.amazon.com/dp/B0DRWBJDLJ",
    "https://www.amazon.com/dp/B08B8M5JGN",
    "https://www.amazon.com/dp/B09WTTWH1R",
]

documents = []
for url in product_urls:
    result = scraper.run(dataset="amazon_product", url=url)

    # Parse the response
    if isinstance(result["data"], str):
        product_data = json.loads(result["data"])
    else:
        product_data = result["data"]

    if not isinstance(product_data, list):
        product_data = [product_data]

    for product in product_data:
        content_parts = [
            f"Product: {product.get('title', 'N/A')}",
            f"Brand: {product.get('brand', 'N/A')}",
            f"Price: ${product.get('final_price', 'N/A')} {product.get('currency', '')}",
            f"Rating: {product.get('rating', 0)}/5",
            f"Reviews Count: {product.get('reviews_count', 0)}",
        ]

        if product.get('description'):
            content_parts.append(f"Description: {product.get('description')}")

        if product.get('features'):
            features_text = '\n  - '.join(product.get('features', []))
            content_parts.append(f"Features:\n  - {features_text}")

        content = '\n'.join(content_parts)

        documents.append(Document(
            content=content,
            meta={
                "url": product.get('url', url),
                "title": product.get('title', ''),
                "price": product.get('final_price', 0),
                "rating": product.get('rating', 0),
            }
        ))

# Embed and store documents
embeddings = docs_embedder.run(documents)
document_store.write_documents(embeddings["documents"])

# Create RAG pipeline with ChatPromptBuilder
messages = [
    ChatMessage.from_system("You are a helpful shopping assistant."),
    ChatMessage.from_user("""
Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
""")
]

prompt_builder = ChatPromptBuilder(template=messages)

# Build and connect pipeline
pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)

pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

# Ask questions about the products
question = "Which product has the best rating?"
response = pipe.run({
    "embedder": {"text": question},
    "prompt_builder": {"question": question}
})

print(f"Answer: {response['llm']['replies'][0].text}")

SERP + Web Content RAG Pipeline

Use SERP API to find relevant web pages, then use Web Unlocker to extract content for a RAG pipeline:
import os
from haystack import Pipeline, Document
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import ChatMessage
from haystack_brightdata import BrightDataSERP, BrightDataUnlocker
import json

# Set API keys
os.environ["BRIGHT_DATA_API_KEY"] = "your-brightdata-api-key"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

# Initialize components
serp = BrightDataSERP()
unlocker = BrightDataUnlocker(default_output_format="markdown", zone="unblocker")
document_store = InMemoryDocumentStore()
docs_embedder = OpenAIDocumentEmbedder()
text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIChatGenerator(model="gpt-4")

# Search for information
search_query = "best practices for machine learning in production"
search_result = serp.run(query=search_query, num_results=5)
search_data = json.loads(search_result["results"])

# Extract URLs from search results
urls = []
for result in search_data.get("organic", [])[:5]:
    url = result.get("url") or result.get("link")
    if url:
        urls.append(url)

# Fetch content from each URL
documents = []
for url in urls:
    try:
        result = unlocker.run(url=url, output_format="markdown")
        content = result["content"]
        documents.append(Document(
            content=content,
            meta={"url": url}
        ))
    except Exception as e:
        print(f"Failed to fetch {url}: {e}")

# Embed and store documents
embeddings = docs_embedder.run(documents)
document_store.write_documents(embeddings["documents"])

# Create RAG pipeline
messages = [
    ChatMessage.from_system("You are a knowledgeable AI assistant."),
    ChatMessage.from_user("""
Context from web sources:
{% for document in documents %}
    Source: {{ document.meta.url }}
    {{ document.content }}
{% endfor %}

Question: {{question}}
""")
]

prompt_builder = ChatPromptBuilder(template=messages)

pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)

pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

# Ask questions
question = "What are the main challenges of deploying ML models in production?"
response = pipe.run({
    "embedder": {"text": question},
    "prompt_builder": {"question": question}
})

print(f"Answer: {response['llm']['replies'][0].text}")

Supported Datasets

The BrightDataWebScraper component supports 45+ datasets across multiple categories:
CategoryDatasets
E-commerceamazon_product, amazon_product_reviews, amazon_product_search, walmart_product, walmart_seller, ebay_product, homedepot_products, zara_products, etsy_products, bestbuy_products
LinkedInlinkedin_person_profile, linkedin_company_profile, linkedin_job_listings, linkedin_posts, linkedin_people_search
Instagraminstagram_profiles, instagram_posts, instagram_reels, instagram_comments
Facebookfacebook_posts, facebook_marketplace_listings, facebook_company_reviews, facebook_events
TikToktiktok_profiles, tiktok_posts, tiktok_shop, tiktok_comments
YouTubeyoutube_profiles, youtube_videos, youtube_comments
Search & Commercegoogle_maps_reviews, google_shopping, google_play_store, apple_app_store, zillow_properties_listing, booking_hotel_listings
Business Intelligencecrunchbase_company, zoominfo_company_profile
Otherreuter_news, github_repository_file, yahoo_finance_business, x_posts, reddit_posts
For detailed information about each dataset and its required parameters:
from haystack_brightdata import BrightDataWebScraper

# List all datasets
datasets = BrightDataWebScraper.get_supported_datasets()
for dataset in datasets:
    print(f"{dataset['id']}: {dataset['description']}")

Use Cases

Bright Data’s Haystack integration enables powerful use cases:
  • E-commerce Intelligence: Price monitoring, product data extraction, and competitive analysis
  • Social Media Analytics: Content monitoring and engagement analysis across platforms
  • Business Intelligence: Company research and competitive landscape analysis
  • Search Analysis: SEO/SEM research with geo-targeted search results
  • Content Aggregation: Building RAG pipelines with real-time web data
  • Market Research: Accessing geo-restricted content for global research