Text Analyzers and Custom Mappings in OpenSearch

April 14, 2026 · 7 min read

Engineering @ FoundryDB

The analyzer you assign to a field determines how text is broken into tokens at index time and at query time. Choosing the wrong one produces surprising results: searches that miss obvious matches, or matches that should not have ranked at all. This post compares the built-in English and standard analyzers, then builds a custom analyzer with synonym support.

All examples were tested against OpenSearch 2.19.1 on FoundryDB staging.

OpenSearch cluster, query fan-out & gather

Cluster green · search fans out to one copy per shard, then gathers

Coordinatorfan-out / gatherquery →Data nodesP0 P1 P2 · R0 R1 R2⇠ hitsCluster-managershard allocation

Cluster-managerCoordinatorData nodePrimary shardReplica shardcluster state / gather (dashed)

How Analyzers Work

When you index a document, OpenSearch runs the field value through the analyzer pipeline and stores the resulting tokens. When you search, the query string goes through the same pipeline. A match happens when a query token aligns with an indexed token.

The pipeline has three stages:

Character filters (optional): transform the raw text before tokenization (e.g., strip HTML)
Tokenizer: split text into tokens (e.g., split on whitespace and punctuation)
Token filters: transform tokens (e.g., lowercase, remove stop words, apply stemming, expand synonyms)

Comparing English vs Standard Analyzer

Use the _analyze API to inspect what tokens an analyzer produces without indexing anything:

# English analyzer
curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/_analyze \
  -H "Content-Type: application/json" \
  -d '{
    "analyzer": "english",
    "text": "databases designing distributed systems"
  }' | jq '[.tokens[].token]'

["databas", "design", "distribut", "system"]

# Standard analyzer
curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/_analyze \
  -H "Content-Type: application/json" \
  -d '{
    "analyzer": "standard",
    "text": "databases designing distributed systems"
  }' | jq '[.tokens[].token]'

["databases", "designing", "distributed", "systems"]

The English analyzer applies the Porter stemmer, which reduces words to their root form ("databases" to "databas", "designing" to "design"). It also removes stop words ("the", "a", "is") before they reach the index.

The practical effect: with the English analyzer, a search for "database" matches documents containing "databases", "database", and "databases'". With the standard analyzer, "database" only matches documents that contain exactly that token after lowercasing.

When to Use Which

Analyzer	Use case
`english`	Prose fields: descriptions, summaries, article bodies
`standard`	General text when stemming would be too aggressive
`keyword`	IDs, tags, genre labels, exact-match fields
Custom	Domain terminology, abbreviations, synonyms

Building a Custom Analyzer with Synonyms

Custom analyzers are defined in the index settings under analysis. Here is an analyzer designed for technical content where common abbreviations should expand to their full forms.

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X PUT https://YOUR_HOST:9200/tech-docs \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
        "filter": {
          "tech_synonyms": {
            "type": "synonym",
            "synonyms": [
              "db => database",
              "k8s => kubernetes",
              "js => javascript",
              "ml => machine_learning"
            ]
          },
          "tech_stop": {
            "type": "stop",
            "stopwords": "_english_"
          }
        },
        "analyzer": {
          "tech_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase", "tech_synonyms", "tech_stop"]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "title":   { "type": "text", "analyzer": "tech_analyzer" },
        "content": { "type": "text", "analyzer": "tech_analyzer" },
        "tags":    { "type": "keyword" }
      }
    }
  }'

The filter order matters. Lowercasing happens before synonym expansion, so the synonym rules work regardless of input case. Stop words are removed after synonyms so that stop words introduced by synonym expansion are cleaned up correctly.

Verifying the Analyzer Behavior

Test what the custom analyzer produces at index time:

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/tech-docs/_analyze \
  -H "Content-Type: application/json" \
  -d '{
    "analyzer": "tech_analyzer",
    "text": "k8s cluster setup"
  }' | jq '[.tokens[].token]'

["k8s", "kubernetes", "cluster", "setup"]

Both k8s and kubernetes are stored as tokens for this document. Now test what happens at query time when you search for "kubernetes":

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/tech-docs/_analyze \
  -H "Content-Type: application/json" \
  -d '{
    "analyzer": "tech_analyzer",
    "text": "kubernetes"
  }' | jq '[.tokens[].token]'

["kubernetes", "k8s"]

Because the same analyzer runs at query time, "kubernetes" expands to both tokens. This means the query matches documents that contain either "k8s" or "kubernetes" in the indexed field.

Index Time vs Search Time Synonyms

Using the same analyzer at both index time and search time (as above) is the simplest approach and works well for most cases. The alternative is to define separate analyzers: one for indexing (no synonyms, just stemming) and one for searching (synonyms applied only at query time). The search-time-only approach avoids index bloat from synonym expansion but requires maintaining two analyzer definitions.

For most developer-facing search use cases, using a single analyzer for both is the pragmatic choice.

Synonym Search in Practice

Index a document that uses the abbreviation:

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/tech-docs/_doc/1 \
  -H "Content-Type: application/json" \
  -d '{
    "title": "k8s cluster setup",
    "content": "How to configure a k8s cluster with db connections"
  }'

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/tech-docs/_refresh

Now search using the expanded form:

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/tech-docs/_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "match": { "content": "kubernetes" }
    }
  }' | jq '.hits.total.value'

Returns 1. Staging confirmed bidirectional synonym resolution:

Searching "kubernetes" finds docs containing "k8s"
Searching "database" finds docs containing "db"
Searching "db" finds docs containing "database"

Multi-Field Mappings

Sometimes you want to index the same field with multiple analyzers, for example to support both full-text search and exact-match filtering on the same value. Use the fields parameter:

{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "keyword": {
            "type": "keyword"
          },
          "tech": {
            "type": "text",
            "analyzer": "tech_analyzer"
          }
        }
      }
    }
  }
}

This creates three representations of title:

title (text, English analyzer): for full-text search with stemming
title.keyword (keyword): for exact-match filtering, sorting, and aggregations
title.tech (text, custom analyzer): for synonym-aware search

Reference the sub-field in queries with dot notation: { "match": { "title.tech": "kubernetes" } }.

Refresh Before Searching

OpenSearch writes documents to an in-memory buffer and periodically flushes them to a searchable segment. The default refresh interval is 1 second in production. In scripts and tests, call _refresh explicitly after bulk ingestion to avoid stale results:

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/INDEX_NAME/_refresh

Do not call _refresh on every individual document insert in production. It is an expensive operation. Batch your writes and refresh once at the end, or rely on the automatic 1-second interval.

Summary: Mapping Strategy

Field type	Recommended mapping	Rationale
Prose descriptions	`text` + `english` analyzer	Stemming improves recall
Titles	`text` + `english`, with `.keyword` sub-field	Search + sort/aggregate
IDs, tags, genres	`keyword`	Exact match, no analysis
Domain abbreviations	`text` + custom synonym analyzer	Controlled vocabulary
Numeric filters	`integer`, `float`	Range queries, aggregations
Dates	`date`	Range queries, date math

What's Next

This post completes the three-part OpenSearch series on FoundryDB. From here:

Provisioning an OpenSearch Cluster on FoundryDB and Connecting via TLS
Creating Indexes, Ingesting Data, and Running Full-Text Searches
Browse the FoundryDB docs for index lifecycle management, backups, and multi-node scaling

Ready to try it? Sign up for FoundryDB or provision an OpenSearch cluster from your existing dashboard.

How Analyzers Work​

Comparing English vs Standard Analyzer​

When to Use Which​

Building a Custom Analyzer with Synonyms​

Verifying the Analyzer Behavior​

Index Time vs Search Time Synonyms​

Synonym Search in Practice​

Multi-Field Mappings​

Refresh Before Searching​

Summary: Mapping Strategy​

What's Next​