Log Analytics in OpenSearch with PPL: Filter, Aggregate, and Correlate
OpenSearch SQL is familiar, but Piped Processing Language (PPL) is often better for log analysis: its pipeline syntax maps naturally to how you think about filtering and aggregating event streams. Each | stage narrows or transforms the result set from the previous one. This post runs five PPL queries against a 200-document structured log dataset on a live OpenSearch 2.19.1 cluster managed by FoundryDB. Every number below is from a real query.
All commands use YOUR_OPENSEARCH_HOST and YOUR_PASSWORD as placeholders.
Prerequisites
- A running FoundryDB OpenSearch cluster.
curlandjqinstalled locally.
The Dataset
200 log documents were ingested into app-logs-2026.04.14. Each document has these fields:
| Field | Type | Example values |
|---|---|---|
timestamp | date | 2026-04-14T18:04:48Z |
level | keyword | INFO, WARN, ERROR |
service | keyword | api, auth, worker, scheduler |
path | keyword | /v1/users, /v1/orders, /v1/health, /v1/auth/login, /v1/products |
status_code | integer | 200, 201, 400, 401, 403, 500, 503 |
latency_ms | integer | varies |
message | text | free-form log message |
All PPL queries go to the same endpoint:
curl -u app_user:YOUR_PASSWORD -k \
-X POST "https://YOUR_OPENSEARCH_HOST:9200/_plugins/_ppl" \
-H "Content-Type: application/json" \
-d '{"query": "YOUR_PPL_QUERY"}'
Query 1: Error Count by Service
Find which services are producing the most errors:
source=app-logs-2026.04.14 | where level = 'ERROR' | stats count() as error_count by service | sort -error_count
curl -u app_user:YOUR_PASSWORD -k \
-X POST "https://YOUR_OPENSEARCH_HOST:9200/_plugins/_ppl" \
-H "Content-Type: application/json" \
-d '{"query": "source=app-logs-2026.04.14 | where level = '\''ERROR'\'' | stats count() as error_count by service | sort -error_count"}'
Results:
service | error_count
-----------|------------
worker | 16
api | 11
scheduler | 8
auth | 7
The worker service generates more than twice as many errors as auth. This is the first place to look when diagnosing reliability issues in this dataset.
Query 2: Average and Maximum Latency by Path
Identify which endpoints are slowest on average:
source=app-logs-2026.04.14 | stats avg(latency_ms) as avg_latency, max(latency_ms) as max_latency by path | sort -avg_latency
Results:
path | avg_latency | max_latency
------------------|-------------|------------
/v1/orders | 495.4ms | 839ms
/v1/users | 446.6ms | 844ms
/v1/health | 443.8ms | 850ms
/v1/auth/login | 435.8ms | 827ms
/v1/products | 410.3ms | 819ms
/v1/orders has the highest average latency at 495.4 ms and a max of 839 ms. The health check endpoint (/v1/health) at 443.8 ms average is worth investigating: health checks typically complete in single-digit milliseconds and should not be this slow.
Query 3: Top 5 Slowest 5xx Errors
Combine a status code filter with a latency sort to find the worst-case errors:
source=app-logs-2026.04.14 | where status_code >= 500 | fields timestamp, service, path, status_code, latency_ms | sort -latency_ms | head 5
Results:
timestamp | service | path | status_code | latency_ms
---------------------|-----------|------------------|-------------|----------
2026-04-14 18:36:36 | api | /v1/orders | 500 | 818ms
2026-04-14 18:51:00 | auth | /v1/users | 500 | 803ms
2026-04-14 18:31:30 | api | /v1/health | 500 | 791ms
2026-04-14 18:04:48 | scheduler | /v1/auth/login | 500 | 768ms
2026-04-14 18:09:54 | worker | /v1/orders | 503 | 763ms
These are the five requests that combined a server-side error with the worst response times. They are the highest-priority incidents to investigate. Note that /v1/orders appears twice, confirming it is under stress.
Query 4: Status Code Distribution
Get a count of every status code across all requests:
source=app-logs-2026.04.14 | stats count() as count by status_code | sort status_code
Results:
status_code | count
------------|------
200 | 66
201 | 29
400 | 21
401 | 19
403 | 23
500 | 23
503 | 19
Out of 200 requests: 95 were successful (200 + 201), 63 were client errors (400 + 401 + 403), and 42 were server errors (500 + 503). A server error rate of 21% (42 / 200) in a production system would require immediate attention.
Query 5: High-Latency Requests by Service
Find services where a large proportion of requests are taking over 500 ms:
source=app-logs-2026.04.14 | where latency_ms > 500 | stats count() as slow_requests, avg(latency_ms) as avg_slow_latency by service | sort -slow_requests
Results:
service | slow_requests | avg_slow_latency
----------|---------------|----------------
scheduler | 25 | 680.8ms
auth | 24 | 673.0ms
api | 23 | 641.0ms
worker | 22 | 655.6ms
All four services have a similar count of slow requests (22 to 25), which suggests a systemic issue rather than a problem with a specific service. When slow requests are distributed evenly, look at shared infrastructure: database connection pool exhaustion, network latency, or a shared upstream dependency.
PPL Syntax Reference
PPL uses a pipeline model. Each stage receives the output of the previous stage.
| Command | Purpose | Example |
|---|---|---|
source= | Select the index | source=app-logs-* |
where | Filter rows | where level = 'ERROR' |
fields | Select columns | fields timestamp, service, path |
stats | Aggregate | stats count() as n, avg(latency_ms) as avg by service |
sort | Sort results | sort -latency_ms (descending) |
head | Limit rows | head 10 |
dedup | Remove duplicates | dedup service |
There is no GROUP BY syntax in PPL. Grouping is expressed as the by clause in stats. This mirrors the mental model of "I want statistics, broken down by field", which is more natural for log analysis than SQL's aggregation syntax.
What's Next
- Automate cleanup of old log indices with Index State Management (ISM) in OpenSearch.
- Restrict PPL access to specific indices per team using fine-grained access control.
Provision a FoundryDB OpenSearch cluster for your log analytics pipeline at foundrydb.com. Documentation at docs.foundrydb.com.