Creating Indexes, Ingesting Data, and Running Full-Text Searches in OpenSearch

April 14, 2026 · 6 min read

Engineering @ FoundryDB

With your OpenSearch cluster running on FoundryDB, the next step is putting data into it and querying it. This post covers explicit index mappings, the bulk ingestion API, full-text search with relevance scoring, fuzzy matching, bool queries, and aggregations.

All results are from real queries run against OpenSearch 2.19.1 on FoundryDB staging. Replace YOUR_HOST and YOUR_DB_PASSWORD with your cluster values throughout.

OpenSearch cluster, query fan-out & gather

Cluster green · search fans out to one copy per shard, then gathers

Coordinatorfan-out / gatherquery →Data nodesP0 P1 P2 · R0 R1 R2⇠ hitsCluster-managershard allocation

Cluster-managerCoordinatorData nodePrimary shardReplica shardcluster state / gather (dashed)

Prerequisites

A running FoundryDB OpenSearch cluster
curl and jq installed locally

Step 1: Create an Index with Explicit Mappings

Always define mappings explicitly rather than relying on dynamic mapping. Dynamic mapping guesses field types and can make text fields keyword-only or vice versa, which breaks search behavior in non-obvious ways.

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X PUT https://YOUR_HOST:9200/books \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    },
    "mappings": {
      "properties": {
        "title":       { "type": "text", "analyzer": "english" },
        "author":      { "type": "keyword" },
        "description": { "type": "text", "analyzer": "english" },
        "year":        { "type": "integer" },
        "genre":       { "type": "keyword" },
        "rating":      { "type": "float" }
      }
    }
  }'

Key decisions here:

text with "analyzer": "english" applies stemming and stop-word removal, so "databases" and "database" both match the same indexed tokens.
keyword fields are not analyzed. Use them for exact-match filtering, sorting, and aggregations (author, genre, rating buckets).
number_of_replicas: 0 keeps the cluster status green on a single-node setup.

Step 2: Bulk Ingest Documents

The bulk API uses NDJSON format: each document requires two lines, an action line and a document line. There is no separator between pairs.

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/books/_bulk \
  -H "Content-Type: application/x-ndjson" \
  -d '
{"index": {"_id": "1"}}
{"title": "Designing Data-Intensive Applications", "author": "Martin Kleppmann", "description": "Deep dive into data systems covering databases replication and consistency", "year": 2017, "genre": "technology", "rating": 4.9}
{"index": {"_id": "2"}}
{"title": "The Pragmatic Programmer", "author": "David Thomas", "description": "Practical advice for software developers on improving code quality", "year": 1999, "genre": "technology", "rating": 4.8}
{"index": {"_id": "3"}}
{"title": "Database Internals", "author": "Alex Petrov", "description": "A deep dive into how distributed databases and storage engines work", "year": 2019, "genre": "technology", "rating": 4.8}
{"index": {"_id": "4"}}
{"title": "Clean Code", "author": "Robert Martin", "description": "Principles patterns and practices of writing clean maintainable code", "year": 2008, "genre": "technology", "rating": 4.6}
{"index": {"_id": "5"}}
{"title": "The Linux Command Line", "author": "William Shotts", "description": "Complete introduction to Linux shell commands and scripting", "year": 2012, "genre": "technology", "rating": 4.7}
{"index": {"_id": "6"}}
{"title": "Sapiens", "author": "Yuval Noah Harari", "description": "A brief history of humankind from prehistoric times to modernity", "year": 2011, "genre": "history", "rating": 4.5}
{"index": {"_id": "7"}}
{"title": "Thinking Fast and Slow", "author": "Daniel Kahneman", "description": "Explores the two systems of thinking that drive human judgment and decision making", "year": 2011, "genre": "psychology", "rating": 4.5}
'

All 7 documents indexed successfully in one request on staging. After bulk ingestion, force a refresh before querying if you need results immediately (the default refresh interval is 1 second in production, but in test scripts you want to be explicit):

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/books/_refresh

Step 3: Full-Text Search with Field Boosting

multi_match searches across multiple fields simultaneously. Boosting a field (title^2) increases the relevance score for matches in that field relative to other fields.

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/books/_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "multi_match": {
        "query": "databases",
        "fields": ["title^2", "description"]
      }
    },
    "highlight": {
      "fields": {
        "description": {}
      }
    }
  }' | jq '.hits.hits[] | {score: ._score, title: ._source.title, author: ._source.author, highlight: .highlight.description[0]}'

Results from staging:

{
  "score": 3.61,
  "title": "Database Internals",
  "author": "Alex Petrov",
  "highlight": "A deep dive into how distributed <em>databases</em> and storage engines work..."
}
{
  "score": 1.23,
  "title": "Designing Data-Intensive Applications",
  "author": "Martin Kleppmann",
  "highlight": "Deep dive into data systems covering <em>databases</em> replication..."
}

Two documents matched. "Database Internals" scored higher because the word "databases" appeared in the title (boosted field) as well as the description. The highlight API wraps the matched token in <em> tags for display in search UIs.

Step 4: Fuzzy Search

Fuzzy search handles typos and misspellings by allowing a configurable edit distance between the query term and indexed tokens.

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/books/_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "match": {
        "title": {
          "query": "pragmatik",
          "fuzziness": "AUTO"
        }
      }
    }
  }' | jq '.hits.hits[] | {score: ._score, title: ._source.title}'

Result from staging (misspelling "pragmatik" found "The Pragmatic Programmer"):

{
  "score": 1.29,
  "title": "The Pragmatic Programmer"
}

fuzziness: "AUTO" sets the allowed edit distance based on term length: 0 edits for 1-2 character terms, 1 edit for 3-5 characters, 2 edits for longer terms. This is the right default for most search boxes.

Step 5: Bool Query (Filter + Must)

Bool queries combine clauses. The critical distinction: must clauses contribute to relevance scoring, filter clauses do not. Use filter for structured conditions (ranges, terms, dates) where scoring is irrelevant and caching is a benefit.

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/books/_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "bool": {
        "must": [
          { "match": { "genre": "technology" } }
        ],
        "filter": [
          { "range": { "rating": { "gte": 4.7 } } }
        ]
      }
    },
    "sort": [{ "rating": { "order": "desc" } }],
    "_source": ["title", "rating"]
  }' | jq '.hits.hits[]._source'

Results from staging (technology books rated 4.7 or higher):

{"title": "Designing Data-Intensive Applications", "rating": 4.9}
{"title": "The Pragmatic Programmer", "rating": 4.8}
{"title": "Database Internals", "rating": 4.8}
{"title": "The Linux Command Line", "rating": 4.7}

4 documents matched. The filter clause on rating is applied without scoring overhead, and OpenSearch can cache it for repeated queries.

Step 6: Aggregations

Aggregations run analytics over matched documents. They can be combined with a query (to aggregate over a filtered subset) or run across the full index.

curl -u app_user:YOUR_DB_PASSWORD -k \
  -X POST https://YOUR_HOST:9200/books/_search \
  -H "Content-Type: application/json" \
  -d '{
    "size": 0,
    "aggs": {
      "genres": {
        "terms": { "field": "genre" }
      },
      "avg_rating": {
        "avg": { "field": "rating" }
      },
      "rating_distribution": {
        "histogram": {
          "field": "rating",
          "interval": 0.1
        }
      }
    }
  }' | jq '{
    genres: [.aggregations.genres.buckets[] | {genre: .key, count: .doc_count}],
    avg_rating: .aggregations.avg_rating.value,
    rating_distribution: [.aggregations.rating_distribution.buckets[] | select(.doc_count > 0) | {rating: .key, count: .doc_count}]
  }'

Results from staging:

{
  "genres": [
    {"genre": "technology", "count": 5},
    {"genre": "history", "count": 1},
    {"genre": "psychology", "count": 1}
  ],
  "avg_rating": 4.70,
  "rating_distribution": [
    {"rating": 4.5, "count": 3},
    {"rating": 4.6, "count": 1},
    {"rating": 4.8, "count": 2},
    {"rating": 4.9, "count": 1}
  ]
}

size: 0 suppresses the document hits so you only get the aggregation results. This is standard practice when you only need analytics.

What's Next

The queries above work well with the default English analyzer on text fields. The next post goes deeper into analyzer configuration, custom synonym filters, and multi-field mappings for domain-specific search.

Ready to try it? Sign up for FoundryDB or provision an OpenSearch cluster from your existing dashboard.

Prerequisites​

Step 1: Create an Index with Explicit Mappings​

Step 2: Bulk Ingest Documents​

Step 3: Full-Text Search with Field Boosting​

Step 4: Fuzzy Search​

Step 5: Bool Query (Filter + Must)​

Step 6: Aggregations​

What's Next​