Text Analyzers and Custom Mappings in OpenSearch
The analyzer you assign to a field determines how text is broken into tokens at index time and at query time. Choosing the wrong one produces surprising results: searches that miss obvious matches, or matches that should not have ranked at all. This post compares the built-in English and standard analyzers, then builds a custom analyzer with synonym support.
All examples were tested against OpenSearch 2.19.1 on FoundryDB staging.
How Analyzers Work
When you index a document, OpenSearch runs the field value through the analyzer pipeline and stores the resulting tokens. When you search, the query string goes through the same pipeline. A match happens when a query token aligns with an indexed token.
The pipeline has three stages:
- Character filters (optional): transform the raw text before tokenization (e.g., strip HTML)
- Tokenizer: split text into tokens (e.g., split on whitespace and punctuation)
- Token filters: transform tokens (e.g., lowercase, remove stop words, apply stemming, expand synonyms)
Comparing English vs Standard Analyzer
Use the _analyze API to inspect what tokens an analyzer produces without indexing anything:
# English analyzer
curl -u app_user:YOUR_DB_PASSWORD -k \
-X POST https://YOUR_HOST:9200/_analyze \
-H "Content-Type: application/json" \
-d '{
"analyzer": "english",
"text": "databases designing distributed systems"
}' | jq '[.tokens[].token]'
["databas", "design", "distribut", "system"]
# Standard analyzer
curl -u app_user:YOUR_DB_PASSWORD -k \
-X POST https://YOUR_HOST:9200/_analyze \
-H "Content-Type: application/json" \
-d '{
"analyzer": "standard",
"text": "databases designing distributed systems"
}' | jq '[.tokens[].token]'
["databases", "designing", "distributed", "systems"]
The English analyzer applies the Porter stemmer, which reduces words to their root form ("databases" to "databas", "designing" to "design"). It also removes stop words ("the", "a", "is") before they reach the index.
The practical effect: with the English analyzer, a search for "database" matches documents containing "databases", "database", and "databases'". With the standard analyzer, "database" only matches documents that contain exactly that token after lowercasing.
When to Use Which
| Analyzer | Use case |
|---|---|
english | Prose fields: descriptions, summaries, article bodies |
standard | General text when stemming would be too aggressive |
keyword | IDs, tags, genre labels, exact-match fields |
| Custom | Domain terminology, abbreviations, synonyms |
Building a Custom Analyzer with Synonyms
Custom analyzers are defined in the index settings under analysis. Here is an analyzer designed for technical content where common abbreviations should expand to their full forms.
curl -u app_user:YOUR_DB_PASSWORD -k \
-X PUT https://YOUR_HOST:9200/tech-docs \
-H "Content-Type: application/json" \
-d '{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"tech_synonyms": {
"type": "synonym",
"synonyms": [
"db => database",
"k8s => kubernetes",
"js => javascript",
"ml => machine_learning"
]
},
"tech_stop": {
"type": "stop",
"stopwords": "_english_"
}
},
"analyzer": {
"tech_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "tech_synonyms", "tech_stop"]
}
}
}
},
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "tech_analyzer" },
"content": { "type": "text", "analyzer": "tech_analyzer" },
"tags": { "type": "keyword" }
}
}
}'
The filter order matters. Lowercasing happens before synonym expansion, so the synonym rules work regardless of input case. Stop words are removed after synonyms so that stop words introduced by synonym expansion are cleaned up correctly.
Verifying the Analyzer Behavior
Test what the custom analyzer produces at index time:
curl -u app_user:YOUR_DB_PASSWORD -k \
-X POST https://YOUR_HOST:9200/tech-docs/_analyze \
-H "Content-Type: application/json" \
-d '{
"analyzer": "tech_analyzer",
"text": "k8s cluster setup"
}' | jq '[.tokens[].token]'
["k8s", "kubernetes", "cluster", "setup"]
Both k8s and kubernetes are stored as tokens for this document. Now test what happens at query time when you search for "kubernetes":
curl -u app_user:YOUR_DB_PASSWORD -k \
-X POST https://YOUR_HOST:9200/tech-docs/_analyze \
-H "Content-Type: application/json" \
-d '{
"analyzer": "tech_analyzer",
"text": "kubernetes"
}' | jq '[.tokens[].token]'
["kubernetes", "k8s"]
Because the same analyzer runs at query time, "kubernetes" expands to both tokens. This means the query matches documents that contain either "k8s" or "kubernetes" in the indexed field.
Index Time vs Search Time Synonyms
Using the same analyzer at both index time and search time (as above) is the simplest approach and works well for most cases. The alternative is to define separate analyzers: one for indexing (no synonyms, just stemming) and one for searching (synonyms applied only at query time). The search-time-only approach avoids index bloat from synonym expansion but requires maintaining two analyzer definitions.
For most developer-facing search use cases, using a single analyzer for both is the pragmatic choice.
Synonym Search in Practice
Index a document that uses the abbreviation:
curl -u app_user:YOUR_DB_PASSWORD -k \
-X POST https://YOUR_HOST:9200/tech-docs/_doc/1 \
-H "Content-Type: application/json" \
-d '{
"title": "k8s cluster setup",
"content": "How to configure a k8s cluster with db connections"
}'
curl -u app_user:YOUR_DB_PASSWORD -k \
-X POST https://YOUR_HOST:9200/tech-docs/_refresh
Now search using the expanded form:
curl -u app_user:YOUR_DB_PASSWORD -k \
-X POST https://YOUR_HOST:9200/tech-docs/_search \
-H "Content-Type: application/json" \
-d '{
"query": {
"match": { "content": "kubernetes" }
}
}' | jq '.hits.total.value'
Returns 1. Staging confirmed bidirectional synonym resolution:
- Searching "kubernetes" finds docs containing "k8s"
- Searching "database" finds docs containing "db"
- Searching "db" finds docs containing "database"
Multi-Field Mappings
Sometimes you want to index the same field with multiple analyzers, for example to support both full-text search and exact-match filtering on the same value. Use the fields parameter:
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {
"keyword": {
"type": "keyword"
},
"tech": {
"type": "text",
"analyzer": "tech_analyzer"
}
}
}
}
}
}
This creates three representations of title:
title(text, English analyzer): for full-text search with stemmingtitle.keyword(keyword): for exact-match filtering, sorting, and aggregationstitle.tech(text, custom analyzer): for synonym-aware search
Reference the sub-field in queries with dot notation: { "match": { "title.tech": "kubernetes" } }.
Refresh Before Searching
OpenSearch writes documents to an in-memory buffer and periodically flushes them to a searchable segment. The default refresh interval is 1 second in production. In scripts and tests, call _refresh explicitly after bulk ingestion to avoid stale results:
curl -u app_user:YOUR_DB_PASSWORD -k \
-X POST https://YOUR_HOST:9200/INDEX_NAME/_refresh
Do not call _refresh on every individual document insert in production. It is an expensive operation. Batch your writes and refresh once at the end, or rely on the automatic 1-second interval.
Summary: Mapping Strategy
| Field type | Recommended mapping | Rationale |
|---|---|---|
| Prose descriptions | text + english analyzer | Stemming improves recall |
| Titles | text + english, with .keyword sub-field | Search + sort/aggregate |
| IDs, tags, genres | keyword | Exact match, no analysis |
| Domain abbreviations | text + custom synonym analyzer | Controlled vocabulary |
| Numeric filters | integer, float | Range queries, aggregations |
| Dates | date | Range queries, date math |
What's Next
This post completes the three-part OpenSearch series on FoundryDB. From here:
- Provisioning an OpenSearch Cluster on FoundryDB and Connecting via TLS
- Creating Indexes, Ingesting Data, and Running Full-Text Searches
- Browse the FoundryDB docs for index lifecycle management, backups, and multi-node scaling
Ready to try it? Sign up for FoundryDB or provision an OpenSearch cluster from your existing dashboard.