Neural Sparse Search in OpenSearch: Semantic Matching Without a GPU
Dense vector search (k-NN) is powerful but requires embedding both documents and queries with a neural model at query time. Neural sparse search takes a different approach: expand tokens with learned weights at index time, store them as a rank_features field, and at query time do a fast lookup rather than a vector computation. The result is semantic search with no GPU requirement at query time. This post shows the full setup on a live OpenSearch 2.19.1 cluster managed by FoundryDB.
All commands use YOUR_OPENSEARCH_HOST and YOUR_PASSWORD as placeholders.
Prerequisites
- A running FoundryDB OpenSearch cluster.
- ML Commons settings configured for a single-node cluster (see the vector search k-NN guide, Step 1).
Step 1: Register and Deploy the Sparse Encoding Model
The model used is amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill, version 1.0.0. This is Amazon's distilled sparse encoder, designed specifically for doc-only mode.
curl -u app_user:YOUR_PASSWORD -k \
-X POST "https://YOUR_OPENSEARCH_HOST:9200/_plugins/_ml/models/_register" \
-H "Content-Type: application/json" \
-d '{
"name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill",
"version": "1.0.0",
"model_format": "TORCH_SCRIPT"
}'
Registration completed in approximately 15 seconds. Deployment completed in approximately 10 seconds.
# Deploy after registration completes
curl -u app_user:YOUR_PASSWORD -k \
-X POST "https://YOUR_OPENSEARCH_HOST:9200/_plugins/_ml/models/YOUR_SPARSE_MODEL_ID/_deploy"
Step 2: Create the Ingest Pipeline
The processor type is sparse_encoding, not text_embedding. The output field must be of type rank_features.
curl -u app_user:YOUR_PASSWORD -k \
-X PUT "https://YOUR_OPENSEARCH_HOST:9200/_ingest/pipeline/sparse-encoding-pipeline" \
-H "Content-Type: application/json" \
-d '{
"description": "Sparse encoding for neural sparse search",
"processors": [
{
"sparse_encoding": {
"model_id": "YOUR_SPARSE_MODEL_ID",
"field_map": {
"body": "body_sparse"
}
}
}
]
}'
Step 3: Create the Index with rank_features Field
The critical difference from k-NN: the sparse field uses rank_features, not knn_vector. There is no dimension to specify. OpenSearch stores a map of token strings to float weights.
curl -u app_user:YOUR_PASSWORD -k \
-X PUT "https://YOUR_OPENSEARCH_HOST:9200/articles-sparse" \
-H "Content-Type: application/json" \
-d '{
"settings": {
"default_pipeline": "sparse-encoding-pipeline",
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"title": {"type": "text"},
"body": {"type": "text"},
"category": {"type": "keyword"},
"body_sparse": {"type": "rank_features"}
}
}
}'
Step 4: Inspect the Stored Token Weights
After ingesting a document, you can retrieve the raw sparse representation to understand what the model learned. Here are the top 10 tokens by weight for the "Introduction to Neural Networks" article:
[
{"token": "network", "weight": 1.3393087},
{"token": "neural", "weight": 1.3120507},
{"token": "brain", "weight": 1.2210519},
{"token": "networks", "weight": 1.1138264},
{"token": "model", "weight": 0.89889926},
{"token": "learning", "weight": 0.8906413},
{"token": "learn", "weight": 0.8564229},
{"token": "patterns", "weight": 0.8360702},
{"token": "models", "weight": 0.82590413},
{"token": "neurons", "weight": 0.8258773}
]
The model automatically expanded the vocabulary: "neural" became "neurons", "network" became "networks", "learn" became both "learn" and "learning". This is term expansion through learned associations, not a stemmer. The weights represent how strongly each token characterises the document in the model's vocabulary space.
Step 5: Run Neural Sparse Search Queries
The query type is neural_sparse. Pass the raw text; the model encodes it into sparse token weights at query time (this is the only moment the model runs at query time in doc-only mode).
curl -u app_user:YOUR_PASSWORD -k \
-X POST "https://YOUR_OPENSEARCH_HOST:9200/articles-sparse/_search" \
-H "Content-Type: application/json" \
-d '{
"size": 5,
"_source": ["title", "category"],
"query": {
"neural_sparse": {
"body_sparse": {
"query_text": "how machines learn",
"model_id": "YOUR_SPARSE_MODEL_ID"
}
}
}
}'
Results for "how machines learn":
{"results": [
{"score": 4.638024, "title": "Deep Learning for Image Recognition", "category": "machine-learning"},
{"score": 3.7028239, "title": "Introduction to Neural Networks", "category": "machine-learning"},
{"score": 3.3865764, "title": "Gradient Descent and Backpropagation", "category": "machine-learning"},
{"score": 0.5032142, "title": "Transformer Architecture Explained", "category": "machine-learning"}
]}
Results for "slow database query fix":
{"results": [
{"score": 19.170017, "title": "SQL Query Optimization", "category": "databases"},
{"score": 9.8290825, "title": "Database Indexing Strategies", "category": "databases"},
{"score": 6.0629683, "title": "Time Series Databases", "category": "databases"},
{"score": 5.0210533, "title": "PostgreSQL vs MySQL", "category": "databases"}
]}
The scores are not cosine similarities (0 to 1) as in dense vector search. They are dot products between the query sparse vector and the document sparse vector, so they can be larger than 1. What matters is the relative ranking and the gap between results.
How Doc-Only Mode Works
In doc-only mode, the model runs once per document at index time and stores the expanded token weights in rank_features. At query time, the raw query text is also encoded into token weights, and OpenSearch computes the dot product between the query weights and each document's stored weights. This is equivalent to a weighted term match, which is fast and can be accelerated by standard inverted index infrastructure.
The alternative is bi-encoder mode, where the model runs at query time for every query. Doc-only mode trades some relevance quality for much lower query latency and zero GPU requirement at query time.
Comparing Sparse vs Dense Vector Search
| Property | Neural sparse (doc-only) | Dense k-NN |
|---|---|---|
| GPU at query time | Not required | Not required (Lucene engine) |
| GPU at index time | Not required | Not required |
| Score range | Unbounded dot product | 0 to 1 (cosine) |
| Field type | rank_features | knn_vector |
| Vocabulary | Expanded tokens | 384-dim float vector |
| Explainability | Token weights visible | Black box |
| Index size | Typically larger | Fixed (dimension x 4 bytes) |
What's Next
- Hybrid search combining BM25 and vector search with RRF
- Vector search in OpenSearch: embeddings, k-NN indexes, and HNSW
- Building a RAG pipeline with OpenSearch as the vector store
Provision a FoundryDB OpenSearch cluster at foundrydb.com. Documentation at docs.foundrydb.com.