Skip to main content

Log Analytics in OpenSearch with PPL: Filter, Aggregate, and Correlate

· 5 min read
FoundryDB Team
Engineering @ FoundryDB

OpenSearch SQL is familiar, but Piped Processing Language (PPL) is often better for log analysis: its pipeline syntax maps naturally to how you think about filtering and aggregating event streams. Each | stage narrows or transforms the result set from the previous one. This post runs five PPL queries against a 200-document structured log dataset on a live OpenSearch 2.19.1 cluster managed by FoundryDB. Every number below is from a real query.

All commands use YOUR_OPENSEARCH_HOST and YOUR_PASSWORD as placeholders.

Prerequisites

  • A running FoundryDB OpenSearch cluster.
  • curl and jq installed locally.

The Dataset

200 log documents were ingested into app-logs-2026.04.14. Each document has these fields:

FieldTypeExample values
timestampdate2026-04-14T18:04:48Z
levelkeywordINFO, WARN, ERROR
servicekeywordapi, auth, worker, scheduler
pathkeyword/v1/users, /v1/orders, /v1/health, /v1/auth/login, /v1/products
status_codeinteger200, 201, 400, 401, 403, 500, 503
latency_msintegervaries
messagetextfree-form log message

All PPL queries go to the same endpoint:

curl -u app_user:YOUR_PASSWORD -k \
-X POST "https://YOUR_OPENSEARCH_HOST:9200/_plugins/_ppl" \
-H "Content-Type: application/json" \
-d '{"query": "YOUR_PPL_QUERY"}'

Query 1: Error Count by Service

Find which services are producing the most errors:

source=app-logs-2026.04.14 | where level = 'ERROR' | stats count() as error_count by service | sort -error_count
curl -u app_user:YOUR_PASSWORD -k \
-X POST "https://YOUR_OPENSEARCH_HOST:9200/_plugins/_ppl" \
-H "Content-Type: application/json" \
-d '{"query": "source=app-logs-2026.04.14 | where level = '\''ERROR'\'' | stats count() as error_count by service | sort -error_count"}'

Results:

service    | error_count
-----------|------------
worker | 16
api | 11
scheduler | 8
auth | 7

The worker service generates more than twice as many errors as auth. This is the first place to look when diagnosing reliability issues in this dataset.

Query 2: Average and Maximum Latency by Path

Identify which endpoints are slowest on average:

source=app-logs-2026.04.14 | stats avg(latency_ms) as avg_latency, max(latency_ms) as max_latency by path | sort -avg_latency

Results:

path              | avg_latency | max_latency
------------------|-------------|------------
/v1/orders | 495.4ms | 839ms
/v1/users | 446.6ms | 844ms
/v1/health | 443.8ms | 850ms
/v1/auth/login | 435.8ms | 827ms
/v1/products | 410.3ms | 819ms

/v1/orders has the highest average latency at 495.4 ms and a max of 839 ms. The health check endpoint (/v1/health) at 443.8 ms average is worth investigating: health checks typically complete in single-digit milliseconds and should not be this slow.

Query 3: Top 5 Slowest 5xx Errors

Combine a status code filter with a latency sort to find the worst-case errors:

source=app-logs-2026.04.14 | where status_code >= 500 | fields timestamp, service, path, status_code, latency_ms | sort -latency_ms | head 5

Results:

timestamp            | service   | path             | status_code | latency_ms
---------------------|-----------|------------------|-------------|----------
2026-04-14 18:36:36 | api | /v1/orders | 500 | 818ms
2026-04-14 18:51:00 | auth | /v1/users | 500 | 803ms
2026-04-14 18:31:30 | api | /v1/health | 500 | 791ms
2026-04-14 18:04:48 | scheduler | /v1/auth/login | 500 | 768ms
2026-04-14 18:09:54 | worker | /v1/orders | 503 | 763ms

These are the five requests that combined a server-side error with the worst response times. They are the highest-priority incidents to investigate. Note that /v1/orders appears twice, confirming it is under stress.

Query 4: Status Code Distribution

Get a count of every status code across all requests:

source=app-logs-2026.04.14 | stats count() as count by status_code | sort status_code

Results:

status_code | count
------------|------
200 | 66
201 | 29
400 | 21
401 | 19
403 | 23
500 | 23
503 | 19

Out of 200 requests: 95 were successful (200 + 201), 63 were client errors (400 + 401 + 403), and 42 were server errors (500 + 503). A server error rate of 21% (42 / 200) in a production system would require immediate attention.

Query 5: High-Latency Requests by Service

Find services where a large proportion of requests are taking over 500 ms:

source=app-logs-2026.04.14 | where latency_ms > 500 | stats count() as slow_requests, avg(latency_ms) as avg_slow_latency by service | sort -slow_requests

Results:

service   | slow_requests | avg_slow_latency
----------|---------------|----------------
scheduler | 25 | 680.8ms
auth | 24 | 673.0ms
api | 23 | 641.0ms
worker | 22 | 655.6ms

All four services have a similar count of slow requests (22 to 25), which suggests a systemic issue rather than a problem with a specific service. When slow requests are distributed evenly, look at shared infrastructure: database connection pool exhaustion, network latency, or a shared upstream dependency.

PPL Syntax Reference

PPL uses a pipeline model. Each stage receives the output of the previous stage.

CommandPurposeExample
source=Select the indexsource=app-logs-*
whereFilter rowswhere level = 'ERROR'
fieldsSelect columnsfields timestamp, service, path
statsAggregatestats count() as n, avg(latency_ms) as avg by service
sortSort resultssort -latency_ms (descending)
headLimit rowshead 10
dedupRemove duplicatesdedup service

There is no GROUP BY syntax in PPL. Grouping is expressed as the by clause in stats. This mirrors the mental model of "I want statistics, broken down by field", which is more natural for log analysis than SQL's aggregation syntax.

What's Next

  • Automate cleanup of old log indices with Index State Management (ISM) in OpenSearch.
  • Restrict PPL access to specific indices per team using fine-grained access control.

Provision a FoundryDB OpenSearch cluster for your log analytics pipeline at foundrydb.com. Documentation at docs.foundrydb.com.