Advanced RAG: Automated Structured Metadata Enrichment

_{Last Updated:
October 10, 2024}

by Tuana Celik ( LI, Twitter)

In this example, you’ll see how you can make use of structured outputs which is an option for some LLMs, and a custom Haystack component, to automate the enrichment of metadata from documents.

You will see how you can define your own metadata fields as a Pydantic Model, as well as the data types each field should have. Finally, you will get a custom MetadataEnricher to extract the required fields and add them to the document meta information.

In this example, we will be enriching metadata with information relating the funding announements.

Once we populate the metadata of a document with our own fields, we are able to use Metadata Filtering during the retrieval step of RAG pipelines. We can even combine this with Metadata Extraction from Queries to Improve Retrieval to be very precise about what documents we are providing as context to an LLM.

Install requirements

!pip install haystack-ai
!pip install trafilatura

from haystack import Document, component, Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.converters import HTMLToDocument
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.generators import OpenAIGenerator
from haystack.components.generators.openai_utils import _convert_message_to_openai_format
from haystack.components.preprocessors import DocumentSplitter
from haystack.dataclasses import ChatMessage, StreamingChunk

from openai import Stream
from openai.types.chat import ChatCompletion, ChatCompletionChunk
from typing import List, Any, Dict, Optional, Callable, Union
from pydantic import BaseModel

🧪 Experimental Addition to the OpenAIGenerator for Structured Output Support

🚀 This is the same extension to the OpenAIGenerator that was used in the Advanced RAG: Query Decomposition and Reasoning example

Let’s extend the OpenAIGeneraotor to be able to make use of the strctured output option by OpenAI. Below, we extend the class to call self.client.beta.chat.completions.parse if the user has provides a respose_format in generation_kwargs. This will allow us to provifde a Pydantic Model to the gnerator and request our generator to respond with structured outputs that adhere to this Pydantic schema.

class OpenAIGenerator(OpenAIGenerator):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    @component.output_types(replies=List[str], meta=List[Dict[str, Any]], structured_reply=BaseModel)
    def run(self, prompt: str, streaming_callback: Optional[Callable[[StreamingChunk], None]] = None, generation_kwargs: Optional[Dict[str, Any]] = None,):
      generation_kwargs = {**self.generation_kwargs, **(generation_kwargs or {})}
      if "response_format" in generation_kwargs.keys():
        message = ChatMessage.from_user(prompt)
        if self.system_prompt:
            messages = [ChatMessage.from_system(self.system_prompt), message]
        else:
            messages = [message]

        streaming_callback = streaming_callback or self.streaming_callback
        openai_formatted_messages = [_convert_message_to_openai_format(message) for message in messages]
        completion: Union[Stream[ChatCompletionChunk], ChatCompletion] = self.client.beta.chat.completions.parse(
            model=self.model,
            messages=openai_formatted_messages,
            **generation_kwargs)
        completions = [self._build_structured_message(completion, choice) for choice in completion.choices]
        for response in completions:
            self._check_finish_reason(response)

        return {
            "replies": [message.content for message in completions],
            "meta": [message.meta for message in completions],
            "structured_reply": completions[0].content
        }
      else:
          return super().run(prompt, streaming_callback, generation_kwargs)

    def _build_structured_message(self, completion: Any, choice: Any) -> ChatMessage:
        chat_message = ChatMessage.from_assistant(choice.message.parsed or "")
        chat_message.meta.update(
            {
                "model": completion.model,
                "index": choice.index,
                "finish_reason": choice.finish_reason,
                "usage": dict(completion.usage),
            }
        )
        return chat_message

import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
  os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key:")

OpenAI API Key:··········

Custom `MetadataEnricher`

We create a custom Haystack component that is able ti accept metadata_model and prompt. If no prompt is provided, it usees the DEFAULT_PROMPT.

This component returns documents enriched with the requested metadata fileds.

DEFAULT_PROMPT = """
Given the contents of the documents, extract the requested metadata.
The requested metadata is {{ metadata_model }}
Document:
{{document}}
Metadata:
"""
@component
class MetadataEnricher:

    def __init__(self, metadata_model: BaseModel, prompt:str = DEFAULT_PROMPT):
        self.metadata_model = metadata_model
        self.metadata_prompt = prompt

        builder = PromptBuilder(self.metadata_prompt)
        llm = OpenAIGenerator(generation_kwargs={"response_format": metadata_model})
        self.pipeline = Pipeline()
        self.pipeline.add_component(name="builder", instance=builder)
        self.pipeline.add_component(name="llm", instance=llm)
        self.pipeline.connect("builder", "llm")

    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document]):
        documents_with_meta = []
        for document in documents:
          result = self.pipeline.run({'builder': {'document': document.content, 'metadata_model': self.metadata_model}})
          metadata = result['llm']['structured_reply']
          document.meta.update(metadata.dict())
          documents_with_meta.append(document)
        return {"documents": documents_with_meta}

Define Metadata Fields as a Pydantic Model

For automatic metadata enrichment, we want to be able to provide a structure describing what fields we want to extract, as well as what types they should be.

Below, I have defined a Metadata model, with 4 fields.

💡 Note: In some cases, it might make sense to make each field optional, or provide default values.

class Metadata(BaseModel):
    company: str
    year: int
    funding_value: int
    funding_currency: str

Next, we initialize a MetadataEnricher and provide Metadata as the metadata_model we want to abide by.

enricher = MetadataEnricher(metadata_model=Metadata)

Build an Automated Metadata Enrichment Pipeline

Now that we have our enricher, we can use it in a pipeline. Below is an example of a pipeline that fetches the contents of some URLs (in this case, urls that contain information about funding announcements). The pipeline then adds the requested metadata fields to each Document’s meta field 👇

pipeline = Pipeline()
pipeline.add_component("fetcher", LinkContentFetcher())
pipeline.add_component("converter", HTMLToDocument())
pipeline.add_component("enricher", enricher)


pipeline.connect("fetcher", "converter")
pipeline.connect("converter.documents", "enricher.documents")

pipeline.run({"fetcher": {"urls": ['https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/',
                                   'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html']}})

{'enricher': {'documents': [Document(id=5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb, content: 'Deepset, a platform for building enterprise apps powered by large language models akin to ChatGPT, t...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD'}),
   Document(id=8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7, content: 'Arize AI Raises $38 Million Series B To Scale Machine Learning Observability Platform
   As companies t...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD'})]}}

pipeline.show()

Extra: Metadata Inheritance

This is just an extra step to show how metadata that belongs to a document is inherited by the document chunks if you use a component such as the DocumentSplitter.

pipeline.add_component("splitter", DocumentSplitter())

pipeline.connect("enricher", "splitter")

<haystack.core.pipeline.pipeline.Pipeline object at 0x77ff80aee8c0>
🚅 Components
  - fetcher: LinkContentFetcher
  - converter: HTMLToDocument
  - enricher: MetadataEnricher
  - splitter: DocumentSplitter
🛤️ Connections
  - fetcher.streams -> converter.sources (List[ByteStream])
  - converter.documents -> enricher.documents (List[Document])
  - enricher.documents -> splitter.documents (List[Document])

pipeline.run({"fetcher": {"urls": ['https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/',
                                   'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html']}})

{'splitter': {'documents': [Document(id=9611aa2bdb658163d8f6964220052065936fcd036dd24743d1b34ce79d25bc5a, content: 'Deepset, a platform for building enterprise apps powered by large language models akin to ChatGPT, t...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
   Document(id=6bffbcf9f1cd1a3940628d1450c9ba9a8c9a092136896d295b35af3175caffbf, content: 'unfortunate state of affairs is likely contributing to challenges around AI development within the e...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 1, 'split_idx_start': 1256}),
   Document(id=da372f9bc2292f487f0aad372053a531744d9549a46f8ac197c216abaa4d99d0, content: 'to end users, and perform analyses of the LLMs’ accuracy while continuously monitoring their perform...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 2, 'split_idx_start': 2609}),
   Document(id=f316cd275e8bc763de41d128dbdbd81e1baad2693b0102f6951c4f46aa8f6048, content: 'predicts that the sector for MLOps will reach $23.1 billion by 2031, up from around $1 billion in 20...', meta: {'content_type': 'text/html', 'url': 'https://techcrunch.com/2023/08/09/deepset-secures-30m-to-expand-its-llm-focused-mlops-offerings/', 'company': 'Deepset', 'year': 2023, 'funding_value': 30000000, 'funding_currency': 'USD', 'source_id': '5844517120556b13f92430ea8af9837714ede1b351580c43c2ddce9b646cb6cb', 'page_number': 1, 'split_id': 3, 'split_idx_start': 3997}),
   Document(id=c7ff4e0d7af8aaa16f3195cb1f9096bb1cf8e7d985190fa6746c278b1d8457e8, content: 'Arize AI Raises $38 Million Series B To Scale Machine Learning Observability Platform
   As companies t...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
   Document(id=646f4d43ee20fcbe35c97bd04e8fd6edd4ad5c9af63fe4a63267f5df1807254f, content: 'by humans.
   Launched in 2020, Arize's ML observability platform is already counted on by a growing li...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 1, 'split_idx_start': 1360}),
   Document(id=09382ae0ab9adbd7860199b0d86e8ca044787eda860dfb181e3b243c6584a427, content: 'what happened, and improve overall model performance," says Morgan Gerlak, Partner at TCV. "Like oth...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 2, 'split_idx_start': 2697}),
   Document(id=94497826791e38f15016a2360d7e3eea5e242770f51dd11be69fb21210e89c9a, content: 'you are going to be left behind," notes Brett Wilson, Co-Founder and General Partner at Swift Ventur...', meta: {'content_type': 'text/html', 'url': 'https://www.prnewswire.com/news-releases/arize-ai-raises-38-million-series-b-to-scale-machine-learning-observability-platform-301620603.html', 'company': 'Arize AI', 'year': 2022, 'funding_value': 38000000, 'funding_currency': 'USD', 'source_id': '8cdcb63a4e006b1cac902ebc2e012cd95156d188777e0d0c8bd407a92f4491c7', 'page_number': 1, 'split_id': 3, 'split_idx_start': 3967})]}}