Building a RAG Pipeline with OpenSearch as the Vector Store
Retrieval-Augmented Generation (RAG) augments a language model's response by first retrieving relevant context from a database, then passing that context into the prompt. OpenSearch is a natural fit for the retrieval step: it runs the embedding model internally, stores the vectors, and returns ranked results in a single query. This post shows the retrieval step with real scores from a live OpenSearch 2.19.1 cluster managed by FoundryDB, and explains how to wire the retrieved chunks into a prompt and call an LLM.
This post uses a dedicated knowledge base index with 6 database documentation chunks, embedded using all-MiniLM-L6-v2 (384 dimensions). The retrieval, prompt assembly, and a complete prompt were all tested on a live FoundryDB cluster.
All commands use YOUR_OPENSEARCH_HOST and YOUR_PASSWORD as placeholders.
Prerequisites
- A running FoundryDB OpenSearch cluster.
- A k-NN index with documents and an ML model deployed.
Step 1: Understand the RAG Architecture
A RAG pipeline has three stages:
- Retrieve: Embed the user query and run a k-NN search to find the most semantically similar document chunks.
- Augment: Inject the retrieved chunks into a prompt template alongside the user's question.
- Generate: Call an LLM with the augmented prompt and return the response.
OpenSearch handles stage 1. Stages 2 and 3 run in your application code or in a workflow orchestrator. OpenSearch 2.19.1 ships with the flow-framework plugin (version 2.19.1.0), which can orchestrate all three stages natively, but this post focuses on the retrieval layer and the integration pattern.
Step 2: The Retrieval Query
The retrieval query is a standard k-NN search. The user's query text is embedded by the same model used at index time, and the result is a ranked list of document chunks by cosine similarity.
curl -u app_user:YOUR_PASSWORD -k \
-X POST "https://YOUR_OPENSEARCH_HOST:9200/articles-knn/_search" \
-H "Content-Type: application/json" \
-d '{
"size": 4,
"_source": ["title", "body", "category"],
"query": {
"neural": {
"body_embedding": {
"query_text": "speeding up slow database queries",
"model_id": "YOUR_MODEL_ID",
"k": 4
}
}
}
}'
The neural query type handles embedding the query text internally. You do not need to call the model separately and pass the vector manually.
Results for "speeding up slow database queries":
{"results": [
{"score": 0.8420365, "title": "SQL Query Optimization", "category": "databases"},
{"score": 0.78991985, "title": "Database Indexing Strategies", "category": "databases"},
{"score": 0.666918, "title": "PostgreSQL vs MySQL", "category": "databases"},
{"score": 0.6576359, "title": "Time Series Databases", "category": "databases"}
]}
Results for "how do machines learn from data":
{"results": [
{"score": 0.7610047, "title": "Introduction to Neural Networks", "category": "machine-learning"},
{"score": 0.7246155, "title": "Gradient Descent and Backpropagation", "category": "machine-learning"},
{"score": 0.7074836, "title": "Deep Learning for Image Recognition", "category": "machine-learning"},
{"score": 0.6563286, "title": "Transformer Architecture Explained", "category": "machine-learning"}
]}
Step 3: Apply a Score Threshold
Not all retrieved documents are relevant. Apply a minimum score threshold to exclude low-quality matches. Based on the test results, a threshold of 0.65 is appropriate for this dataset: it keeps all four results for both queries (minimum score seen: 0.6563) while filtering out noise from off-topic documents.
curl -u app_user:YOUR_PASSWORD -k \
-X POST "https://YOUR_OPENSEARCH_HOST:9200/articles-knn/_search" \
-H "Content-Type: application/json" \
-d '{
"size": 4,
"_source": ["title", "body"],
"query": {
"bool": {
"must": [
{
"neural": {
"body_embedding": {
"query_text": "speeding up slow database queries",
"model_id": "YOUR_MODEL_ID",
"k": 4
}
}
}
],
"filter": [
{"range": {"_score": {"gte": 0.65}}}
]
}
}
}'
In practice, the appropriate threshold depends on your data. Run retrieval queries against a representative set of user questions and inspect where relevant results drop off in the score distribution.
Step 4: Assemble the Prompt
Once you have the retrieved chunks, build the prompt in your application code. We tested this step with a knowledge base index containing 6 database documentation chunks (PostgreSQL VACUUM, indexes, replication, MySQL InnoDB, MongoDB sharding, connection pooling).
Query: "how to handle dead tuples in postgres"
Retrieved chunks:
[
{"score": 0.78043604, "title": "PostgreSQL VACUUM", "source": "postgresql-docs"},
{"score": 0.6208841, "title": "Connection Pooling", "source": "postgresql-docs"},
{"score": 0.5944809, "title": "PostgreSQL Indexes", "source": "postgresql-docs"}
]
The assembled prompt from the real test (filtering chunks with score greater than 0.5):
Answer the following question using ONLY the provided context.
If the context does not contain the answer, say so.
Context:
[postgresql-docs] PostgreSQL uses MVCC for concurrency control. Dead tuples
accumulate from updates and deletes. VACUUM reclaims storage by marking dead
tuples as reusable. AUTOVACUUM runs in the background based on configurable
thresholds.
[postgresql-docs] Connection pooling reduces database overhead by reusing
established connections. PgBouncer is the standard PostgreSQL connection pooler.
It supports transaction and session pooling modes. Transaction mode is
recommended for most workloads.
[postgresql-docs] B-tree indexes are the default index type in PostgreSQL. They
support equality and range queries. GIN indexes are used for full-text search
and JSONB. GiST indexes support geometric and range types.
Question: how to handle dead tuples in postgres
Answer:
The top chunk (score 0.78) is directly relevant: it explains VACUUM and AUTOVACUUM for dead tuple management. The other two chunks are PostgreSQL-related but not directly about dead tuples. An LLM would use the first chunk to generate the answer and acknowledge that the remaining context is tangential.
Step 5: Call the LLM
Pass the assembled prompt to any LLM API. The structure is the same regardless of provider:
import anthropic # or openai, or any other client
client = anthropic.Anthropic(api_key="YOUR_LLM_API_KEY")
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[
{"role": "user", "content": prompt}
]
)
print(response.content[0].text)
The retrieved context is in the user message alongside the question. The LLM is constrained to answer from that context, which reduces hallucination compared to asking the model to answer from training data alone.
Step 6: Orchestrate with the Flow Framework
OpenSearch 2.19.1 includes the flow-framework plugin (version 2.19.1.0), which can register, deploy, and chain ML operations in a single workflow definition. This allows you to define the entire RAG pipeline (model deployment, ingest pipeline creation, k-NN index creation, search pipeline with neural query) as a JSON template and apply it with a single API call.
# Check that flow-framework is available
curl -u app_user:YOUR_PASSWORD -k \
"https://YOUR_OPENSEARCH_HOST:9200/_cat/plugins?v" | grep flow
Expected output: flow-framework 2.19.1.0
A flow-framework template for RAG defines steps in order: register model, deploy model, create ingest pipeline, create index, create search pipeline. Each step references the outputs of previous steps (for example, the model_id from the deploy step is passed to the ingest pipeline step). This removes the manual sequencing shown in the earlier steps of the vector search setup.
Chunk Strategy
The articles in this test are short (100 to 200 words each). In a production RAG system with longer documents, you need a chunking strategy:
- Split documents into overlapping chunks of 256 to 512 tokens.
- Store each chunk as a separate OpenSearch document with a reference to the parent document ID.
- Retrieve chunks, then optionally re-fetch the parent documents for broader context.
The ingest pipeline approach applies directly to chunks. Index each chunk with the text_embedding processor and the same k-NN field configuration.
RAG vs Keyword Search for Question Answering
| Property | Keyword search | RAG |
|---|---|---|
| Query "how to speed up queries" | Matches "queries" literally | Retrieves "SQL Query Optimization", "Database Indexing Strategies" by meaning |
| Answer generation | Returns raw documents | LLM synthesises an answer from retrieved context |
| Hallucination risk | Not applicable (no generation) | Low (context-grounded) but not zero |
| Latency | Single OpenSearch query | OpenSearch query plus LLM call |
| Cost | OpenSearch query cost only | OpenSearch query cost plus LLM token cost |
What's Next
- Improve retrieval quality with hybrid scoring by combining BM25 and vector search using Reciprocal Rank Fusion (RRF).
- Explore neural sparse search as an alternative retrieval approach with different latency characteristics.
- Tune indexing throughput for large embedding pipelines via OpenSearch performance settings.
Provision a FoundryDB OpenSearch cluster for your RAG pipeline at foundrydb.com. Documentation at docs.foundrydb.com.