Integration: Elasticsearch
Use an Elasticsearch database with Haystack
Table of Contents
Haystack 2.0
The ElasticsearchDocumentStore
is maintained in
haystack-core-integrations repo. It allows you to use
Elasticsearch as data storage for your Haystack pipelines.
For a details on available methods, visit the API Reference
Installation
To run an Elasticsearch instance locally, first follow the installation and start up guides.
pip install elasticsearch-haystack
Usage
Once installed, you can start using your Elasticsearch database with Haystack by initializing it:
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(hosts = "http://localhost:9200")
Writing Documents to ElasticsearchDocumentStore
To write documents to your ElasticsearchDocumentStore
, create an indexing pipeline with a
DocumentWriter, or use the write_documents()
function.
For this step, you can use the available
TextFileToDocument and
DocumentSplitter, as well as other
Integrations that might help you fetch data from other resources.
Indexing Pipeline
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
document_store = ElasticsearchDocumentStore(hosts = "http://localhost:9200")
converter = TextFileToDocument()
splitter = DocumentSplitter()
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/multi-qa-mpnet-base-dot-v1")
writer = DocumentWriter(document_store)
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", converter)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("doc_embedder", doc_embedder)
indexing_pipeline.add_component("writer", writer)
indexing_pipeline.connect("converter", "splitter")
indexing_pipeline.connect("splitter", "doc_embedder")
indexing_pipeline.connect("doc_embedder", "writer")
indexing_pipeline.run({
"converter":{"sources":["filename.txt"]}
})
Using Elasticsearch in a Query Pipeline
Once you have documents in your ElasticsearchDocumentStore
, it’s ready to be used with with
ElasticsearchEmbeddingRetriever in the retrieval step of any Haystack pipeline such as a Retrieval Augmented Generation (RAG) pipelines. Learn more about
Retrievers to make use of vector search within your LLM pipelines.
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack_integrations.components.retrievers.elasticsearch import ElasticsearchEmbeddingRetriever
model = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
document_store = ElasticsearchDocumentStore(hosts = "http://localhost:9200")
retriever = ElasticsearchEmbeddingRetriever(document_store=document_store)
text_embedder = SentenceTransformersTextEmbedder(model=model)
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", text_embedder)
query_pipeline.add_component("retriever", retriever)
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
result = query_pipeline.run({"text_embedder": {"text": "historical places in Instanbul"}})
print(result)
Haystack 1.x
The ElasticsearchDocumentStore
is maintained within the core Haystack project. It allows you to use
Elasticsearch as data storage for your Haystack pipelines.
For a details on available methods, visit the API Reference
Installation (1.x)
To run an Elasticsearch instance locally, first follow the installation and start up guides.
pip install farm-haystack[elasticsearch]
To install Elasticsearch 7, you can run pip install farm-haystac[elasticsearch7]
.
Usage (1.x)
Once installed, you can start using your Elasticsearch database with Haystack by initializing it:
from haystack.document_stores import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host = "localhost",
port = 9200,
embedding_dim = 768)
Writing Documents to ElasticsearchDocumentStore
To write documents to your ElasticsearchDocumentStore
, create an indexing pipeline, or use the write_documents()
function.
For this step, you may make use of the available
FileConverters and
PreProcessors, as well as other
Integrations that might help you fetch data from other resources.
Indexing Pipeline
from haystack import Pipeline
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import TextConverter, PreProcessor
document_store = ElasticsearchDocumentStore(host = "localhost", port = 9200)
converter = TextConverter()
preprocessor = PreProcessor()
indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])
indexing_pipeline.run(file_paths=["filename.txt"])
Using Elasticsearch in a Query Pipeline
Once you have documents in your ElasitsearchDocumentStore
, it’s ready to be used in any Haystack pipeline. Such as a Retrieval Augmented Generation (RAG) pipeline. Learn more about
Retrievers to make use of vector search within your LLM pipelines.
from haystack import Pipeline
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import EmbeddingRetriever, PromptNode
document_store = ElasticsearchDocumentStore()
retriever = EmbeddingRetriever(document_store = document_store,
embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1")
prompt_node = PromptNode(model_name_or_path = "google/flan-t5-xl", default_prompt_template = "deepset/question-answering")
query_pipeline = Pipeline()
query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
query_pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])
query_pipeline.run(query = "Where is Istanbul?")