Log Analytics in OpenSearch with PPL: Filter, Aggregate, and Correlate

April 14, 2026 · 5 min read

Engineering @ FoundryDB

OpenSearch SQL is familiar, but Piped Processing Language (PPL) is often better for log analysis: its pipeline syntax maps naturally to how you think about filtering and aggregating event streams. Each | stage narrows or transforms the result set from the previous one. This post runs five PPL queries against a 200-document structured log dataset on a live OpenSearch 2.19.1 cluster managed by FoundryDB. Every number below is from a real query.

All commands use YOUR_OPENSEARCH_HOST and YOUR_PASSWORD as placeholders.

OpenSearch cluster, query fan-out & gather

Cluster green · search fans out to one copy per shard, then gathers

Coordinatorfan-out / gatherquery →Data nodesP0 P1 P2 · R0 R1 R2⇠ hitsCluster-managershard allocation

Cluster-managerCoordinatorData nodePrimary shardReplica shardcluster state / gather (dashed)

Prerequisites

A running FoundryDB OpenSearch cluster.
curl and jq installed locally.

The Dataset

200 log documents were ingested into app-logs-2026.04.14. Each document has these fields:

Field	Type	Example values
`timestamp`	date	2026-04-14T18:04:48Z
`level`	keyword	INFO, WARN, ERROR
`service`	keyword	api, auth, worker, scheduler
`path`	keyword	/v1/users, /v1/orders, /v1/health, /v1/auth/login, /v1/products
`status_code`	integer	200, 201, 400, 401, 403, 500, 503
`latency_ms`	integer	varies
`message`	text	free-form log message

All PPL queries go to the same endpoint:

curl -u app_user:YOUR_PASSWORD -k \
  -X POST "https://YOUR_OPENSEARCH_HOST:9200/_plugins/_ppl" \
  -H "Content-Type: application/json" \
  -d '{"query": "YOUR_PPL_QUERY"}'

Query 1: Error Count by Service

Find which services are producing the most errors:

source=app-logs-2026.04.14 | where level = 'ERROR' | stats count() as error_count by service | sort -error_count

curl -u app_user:YOUR_PASSWORD -k \
  -X POST "https://YOUR_OPENSEARCH_HOST:9200/_plugins/_ppl" \
  -H "Content-Type: application/json" \
  -d '{"query": "source=app-logs-2026.04.14 | where level = '\''ERROR'\'' | stats count() as error_count by service | sort -error_count"}'

Results:

service    | error_count
-----------|------------
worker     | 16
api        | 11
scheduler  |  8
auth       |  7

The worker service generates more than twice as many errors as auth. This is the first place to look when diagnosing reliability issues in this dataset.

Query 2: Average and Maximum Latency by Path

Identify which endpoints are slowest on average:

source=app-logs-2026.04.14 | stats avg(latency_ms) as avg_latency, max(latency_ms) as max_latency by path | sort -avg_latency

Results:

path              | avg_latency | max_latency
------------------|-------------|------------
/v1/orders        |     495.4ms |      839ms
/v1/users         |     446.6ms |      844ms
/v1/health        |     443.8ms |      850ms
/v1/auth/login    |     435.8ms |      827ms
/v1/products      |     410.3ms |      819ms

/v1/orders has the highest average latency at 495.4 ms and a max of 839 ms. The health check endpoint (/v1/health) at 443.8 ms average is worth investigating: health checks typically complete in single-digit milliseconds and should not be this slow.

Query 3: Top 5 Slowest 5xx Errors

Combine a status code filter with a latency sort to find the worst-case errors:

source=app-logs-2026.04.14 | where status_code >= 500 | fields timestamp, service, path, status_code, latency_ms | sort -latency_ms | head 5

Results:

timestamp            | service   | path             | status_code | latency_ms
---------------------|-----------|------------------|-------------|----------
2026-04-14 18:36:36  | api       | /v1/orders       | 500         | 818ms
2026-04-14 18:51:00  | auth      | /v1/users        | 500         | 803ms
2026-04-14 18:31:30  | api       | /v1/health       | 500         | 791ms
2026-04-14 18:04:48  | scheduler | /v1/auth/login   | 500         | 768ms
2026-04-14 18:09:54  | worker    | /v1/orders       | 503         | 763ms

These are the five requests that combined a server-side error with the worst response times. They are the highest-priority incidents to investigate. Note that /v1/orders appears twice, confirming it is under stress.

Query 4: Status Code Distribution

Get a count of every status code across all requests:

source=app-logs-2026.04.14 | stats count() as count by status_code | sort status_code

Results:

status_code | count
------------|------
       |    66
       |    29
       |    21
       |    19
       |    23
       |    23
       |    19

Out of 200 requests: 95 were successful (200 + 201), 63 were client errors (400 + 401 + 403), and 42 were server errors (500 + 503). A server error rate of 21% (42 / 200) in a production system would require immediate attention.

Query 5: High-Latency Requests by Service

Find services where a large proportion of requests are taking over 500 ms:

source=app-logs-2026.04.14 | where latency_ms > 500 | stats count() as slow_requests, avg(latency_ms) as avg_slow_latency by service | sort -slow_requests

Results:

service   | slow_requests | avg_slow_latency
----------|---------------|----------------
scheduler |            25 |         680.8ms
auth      |            24 |         673.0ms
api       |            23 |         641.0ms
worker    |            22 |         655.6ms

All four services have a similar count of slow requests (22 to 25), which suggests a systemic issue rather than a problem with a specific service. When slow requests are distributed evenly, look at shared infrastructure: database connection pool exhaustion, network latency, or a shared upstream dependency.

PPL Syntax Reference

PPL uses a pipeline model. Each stage receives the output of the previous stage.

Command	Purpose	Example
`source=`	Select the index	`source=app-logs-*`
`where`	Filter rows	`where level = 'ERROR'`
`fields`	Select columns	`fields timestamp, service, path`
`stats`	Aggregate	`stats count() as n, avg(latency_ms) as avg by service`
`sort`	Sort results	`sort -latency_ms` (descending)
`head`	Limit rows	`head 10`
`dedup`	Remove duplicates	`dedup service`

There is no GROUP BY syntax in PPL. Grouping is expressed as the by clause in stats. This mirrors the mental model of "I want statistics, broken down by field", which is more natural for log analysis than SQL's aggregation syntax.

What's Next

Automate cleanup of old log indices with Index State Management (ISM) in OpenSearch.
Restrict PPL access to specific indices per team using fine-grained access control.

Provision a FoundryDB OpenSearch cluster for your log analytics pipeline at foundrydb.com. Documentation at docs.foundrydb.com.

Prerequisites​

The Dataset​

Query 1: Error Count by Service​

Query 2: Average and Maximum Latency by Path​

Query 3: Top 5 Slowest 5xx Errors​

Query 4: Status Code Distribution​

Query 5: High-Latency Requests by Service​

PPL Syntax Reference​

What's Next​

Prerequisites

The Dataset

Query 1: Error Count by Service

Query 2: Average and Maximum Latency by Path

Query 3: Top 5 Slowest 5xx Errors

Query 4: Status Code Distribution

Query 5: High-Latency Requests by Service

PPL Syntax Reference

What's Next