RAG: Extract and use website content for question answering with Apify-Haystack integration


Author: Jiri Spilka ( Apify)

In this tutorial, we’ll use the apify-haystack integration to call Website Content Crawler and crawl and scrape text content from the Haystack website. Then, we’ll use the OpenAIDocumentEmbedder to compute text embeddings and the InMemoryDocumentStore to store documents in a temporary in-memory database. The last step will be a retrieval augmented generation pipeline to answer users’ questions from the scraped data.

Install dependencies

!pip install apify-haystack haystack-ai

Set up the API keys

You need to have an Apify account and obtain APIFY_API_TOKEN.

You also need an OpenAI account and OPENAI_API_KEY

import os
from getpass import getpass

os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")
Enter YOUR APIFY_API_TOKEN··········
Enter YOUR OPENAI_API_KEY··········

Use the Website Content Crawler to scrape data from the haystack documentation

Now, let us call the Website Content Crawler using the Haystack component ApifyDatasetFromActorCall. First, we need to define parameters for the Website Content Crawler and then what data we need to save into the vector database.

The actor_id and detailed description of input parameters (variable run_input) can be found on the Website Content Crawler input page.

For this example, we will define startUrls and limit the number of crawled pages to five.

actor_id = "apify/website-content-crawler"
run_input = {
    "maxCrawlPages": 5,  # limit the number of pages to crawl
    "startUrls": [{"url": "https://haystack.deepset.ai/"}],
}

Next, we need to define a dataset mapping function. We need to know the output of the Website Content Crawler. Typically, it is a JSON object that looks like this (truncated for brevity):

[
  {
    "url": "https://haystack.deepset.ai/",
    "text": "Haystack | Haystack - Multimodal - AI - Architect a next generation AI app around all modalities, not just text ..."
  },
  {
    "url": "https://haystack.deepset.ai/tutorials/24_building_chat_app",
    "text": "Building a Conversational Chat App ... "
  },
]

We will convert this JSON to a Haystack Document using the dataset_mapping_function as follows:

from haystack import Document

def dataset_mapping_function(dataset_item: dict) -> Document:
    return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})

And the definition of the ApifyDatasetFromActorCall:

from apify_haystack import ApifyDatasetFromActorCall

apify_dataset_loader = ApifyDatasetFromActorCall(
    actor_id=actor_id,
    run_input=run_input,
    dataset_mapping_function=dataset_mapping_function,
)

Before actually running the Website Content Crawler, we need to define embedding function and document store:

from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
docs_embedder = OpenAIDocumentEmbedder()

After that, we can call the Website Content Crawler and print the scraped data:

# Crawler website and store documents in the document_store
# Crawling will take some time (1-2 minutes), you can monitor progress in the https://console.apify.com/actors/runs

docs = apify_dataset_loader.run()
print(docs)
{'documents': [Document(id=6c4d570874ff59ed4e06017694bee8a72d766d2ed55c6453fbc9ea91fd2e6bde, content: 'Haystack | Haystack Luma · Delightful Events Start HereAWS Summit Berlin 2023: Building Generative A...', meta: {'url': 'https://haystack.deepset.ai/'}), Document(id=d420692bf66efaa56ebea200a4a63597667bdc254841b99654239edf67737bcb, content: 'Tutorials & Walkthroughs | Haystack
Tutorials & Walkthroughs2.0
Whether you’re a beginner or an expe...', meta: {'url': 'https://haystack.deepset.ai/tutorials'}), Document(id=5a529a308d271ba76f66a060c0b706b73103406ac8a853c19f20e1594823efe8, content: 'Get Started | Haystack
Haystack is an open-source Python framework that helps developers build LLM-p...', meta: {'url': 'https://haystack.deepset.ai/overview/quick-start'}), Document(id=1d126a03ae50586729846d492e9e8aca802d7f281a72a8869ded08ebc5585a36, content: 'What is Haystack? | Haystack
Haystack is an open source framework for building production-ready LLM ...', meta: {'url': 'https://haystack.deepset.ai/overview/intro'}), Document(id=4324a62242590d4ecf9b080319607fa1251aa0822bbe2ce6b21047e783999703, content: 'Integrations | Haystack
The Haystack ecosystem integrates with many other technologies, such as vect...', meta: {'url': 'https://haystack.deepset.ai/integrations'})]}

Compute the embeddings and store them in the database:

embeddings = docs_embedder.run(docs.get("documents"))
document_store.write_documents(embeddings["documents"])
Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  3.29it/s]





5

Retrieval and LLM generative pipeline

Once we have the crawled data in the database, we can set up the classical retrieval augmented pipeline. Refer to the RAG Haystack tutorial for details.

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIGenerator(model="gpt-4o-mini")

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)

# Add components to your pipeline
print("Initializing pipeline...")
pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)

# Now, connect the components to each other
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")
Initializing pipeline...





<haystack.core.pipeline.pipeline.Pipeline object at 0x7c02095efdc0>
🚅 Components
  - embedder: OpenAITextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
🛤️ Connections
  - embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)

Now, you can ask questions about Haystack and get correct answers:

question = "What is haystack?"

response = pipe.run({"embedder": {"text": question}, "prompt_builder": {"question": question}})

print(f"question: {question}")
print(f"answer: {response['llm']['replies'][0]}")
question: What is haystack?
answer: Haystack is an open-source Python framework designed to help developers build LLM-powered custom applications. It is used for creating production-ready LLM applications, retrieval-augmented generative pipelines, and state-of-the-art search systems that work effectively over large document collections. Haystack offers comprehensive tooling for developing AI systems that use LLMs from platforms like Hugging Face, OpenAI, Cohere, Mistral, and more. It provides a modular and intuitive framework that allows users to quickly integrate the latest AI models, offering flexibility and ease of use. The framework includes components and pipelines that enable developers to build end-to-end AI projects without the need to understand the underlying models deeply. Haystack caters to LLM enthusiasts and beginners alike, providing a vibrant open-source community for collaboration and learning.