Monitoring & Alerts

Every FoundryDB service is monitored end to end without any setup on your side. An agent runs on each database VM, collects telemetry, and reports it to the controller, which surfaces it as metrics, derives a health status, and evaluates it against your alert rules.

How monitoring works

The pipeline has three stages: collection on the VM, ingestion at the controller, and fan-out to dashboards, health status, and alerting.

Monitoring & alerting pipeline

Agent reports metrics + health · controller evaluates alert rules

AgentDB VMmetrics →Controlleringestevaluate →Alert enginerule evalfires →Channelsemail · webhook

Metrics streamHealth checksControllerAlert engineAlert firesChannels (email · webhook)derived status (dashed)

Collection (agent). The agent on each database VM samples system metrics (CPU, memory, disk, network) on a short interval and queries the database engine for engine-specific metrics. In parallel it runs health checks, for example a connection test, replication status, and engine-specific liveness probes.
Reporting. The agent reports both the metrics stream and the health status to the controller over the internal utility network.
Ingestion and fan-out (controller). The controller stores the metrics, exposes them through the Metrics API and the dashboard, records the per-service health status, and feeds the same stream into the alert engine.
Alert evaluation. The alert engine compares each metric against the rules you have defined. When a rule's condition holds for its configured duration_minutes, the alert fires.
Notification. A firing alert is dispatched to every notification channel attached to its rule, such as email or a webhook (Slack, PagerDuty, and similar).

Health status

Alongside numeric metrics the agent reports a health status that the controller records for the service. Health checks are engine-aware: a connection / liveness test for every engine, plus replication status on replicas, and engine-specific probes (for example WAL archiving on PostgreSQL, replica set membership on MongoDB, and partition health on Kafka). A service that fails its checks is reflected as unhealthy or degraded even when raw metrics are still flowing, which is why health status is a separate feed from the metrics stream in the diagram above.

Metrics API

Query metrics for any service:

curl -u admin:password \
  "https://api.foundrydb.com/managed-services/{id}/metrics?metric=cpu&period=1h"

Parameters:

Parameter	Description	Example
`metric`	Metric name (see below)	`cpu`, `memory`, `connections`
`period`	Time range	`15m`, `1h`, `6h`, `24h`, `7d`
`resolution`	Data point interval	`1m`, `5m`, `1h`

Common Metrics

All engines

Metric	Description
`cpu`	CPU utilisation (%)
`memory`	Memory used (%)
`disk`	Disk used (%)
`disk_iops`	Disk IOPS
`connections`	Active connections
`network_in`	Network bytes received
`network_out`	Network bytes sent

PostgreSQL

Metric	Description
`pg_connections`	Active / idle / waiting connections
`pg_transactions_per_second`	Commits + rollbacks per second
`pg_cache_hit_rate`	Buffer cache hit ratio (target >99%)
`pg_replication_lag_seconds`	Replica lag in seconds
`pg_locks`	Active lock count
`pg_deadlocks`	Deadlocks per minute
`pg_slow_queries`	Queries exceeding `log_min_duration_statement`

MySQL

Metric	Description
`mysql_queries_per_second`	Total QPS
`mysql_innodb_buffer_pool_hit_rate`	Buffer pool efficiency (target >99%)
`mysql_replication_lag_seconds`	Replica lag
`mysql_open_files`	Open file handles

MongoDB

Metric	Description
`mongodb_ops_per_second`	Operations per second by type
`mongodb_replication_lag_seconds`	Replica set lag
`mongodb_wiredtiger_cache_used`	WiredTiger cache utilisation
`mongodb_connections`	Active connections

Valkey

Metric	Description
`valkey_used_memory`	Memory used (bytes)
`valkey_keyspace_hits`	Successful key lookups
`valkey_keyspace_misses`	Cache misses
`valkey_evicted_keys`	Keys evicted due to maxmemory
`valkey_connected_clients`	Connected clients

Kafka

Metric	Description
`kafka_messages_in_per_sec`	Inbound message rate
`kafka_bytes_in_per_sec`	Inbound throughput
`kafka_bytes_out_per_sec`	Outbound throughput
`kafka_under_replicated_partitions`	Partitions not fully replicated (should be 0)
`kafka_consumer_lag`	Messages behind for a consumer group

Alerts

Create an alert rule

curl -u admin:password -X POST \
  https://api.foundrydb.com/managed-services/{id}/alerts/rules \
  -H "Content-Type: application/json" \
  -d '{
    "metric": "cpu",
    "condition": "gt",
    "threshold": 80,
    "duration_minutes": 5,
    "severity": "warning",
    "notification_channel_id": "channel_abc"
  }'

Field	Values
`metric`	Any metric name from the tables above (system or engine-specific)
`condition`	`gt` (above), `lt` (below)
`threshold`	Numeric value the metric is compared against
`severity`	`info`, `warning`, `critical`
`duration_minutes`	How long the condition must persist before firing
`notification_channel_id`	Channel that receives the alert when it fires

A rule that uses duration_minutes will only fire once the condition has held continuously for that long. This suppresses transient spikes (for example a brief CPU burst during a backup) and is the main lever for tuning out false positives: raise the duration before lowering the threshold.

Rules worth starting with

The metrics in the tables above map directly onto the conditions most operators want to alert on. A few examples:

Engine	Metric	Condition	Why
All	`cpu`	`gt` 80 for 5m	Sustained CPU saturation
All	`disk`	`gt` 85	Running out of disk (use a short or zero duration)
PostgreSQL	`pg_replication_lag_seconds`	`gt` 30	Replica falling behind
PostgreSQL	`pg_cache_hit_rate`	`lt` 99	Working set no longer fits in cache
MySQL	`mysql_replication_lag_seconds`	`gt` 30	Replica falling behind
MongoDB	`mongodb_replication_lag_seconds`	`gt` 60	Replica set member lagging
Valkey	`valkey_evicted_keys`	`gt` 0	Hitting `maxmemory` and evicting data
Kafka	`kafka_under_replicated_partitions`	`gt` 0	Partitions not fully replicated
Kafka	`kafka_consumer_lag`	`gt` 10000	Consumer group falling behind

List rules

curl -u admin:password \
  https://api.foundrydb.com/managed-services/{id}/alerts/rules

Delete a rule

curl -u admin:password -X DELETE \
  https://api.foundrydb.com/managed-services/{id}/alerts/rules/{rule_id}

Notification Channels

Alerts can be sent to multiple channels.

Create a webhook channel

curl -u admin:password -X POST \
  https://api.foundrydb.com/alerts/channels \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Slack Production",
    "type": "webhook",
    "config": {"url": "https://hooks.slack.com/services/..."}
  }'

Create an email channel

curl -u admin:password -X POST \
  https://api.foundrydb.com/alerts/channels \
  -H "Content-Type: application/json" \
  -d '{
    "name": "On-call",
    "type": "email",
    "config": {"address": "oncall@example.com"}
  }'

Supported channel types

Type	Description
`email`	Email notification
`webhook`	HTTP POST to any URL (Slack, PagerDuty, etc.)

Query Statistics

For PostgreSQL, real-time query stats are available:

curl -u admin:password \
  "https://api.foundrydb.com/managed-services/{id}/metrics/query-stats?limit=20&order=total_time"

Returns the top queries by total execution time, including: calls, mean time, rows, cache hit rate.

Use this to identify slow queries before they become a problem.

Query Statistics (Full Guide)

Query statistics are available for PostgreSQL and MySQL services. For PostgreSQL the data comes from the pg_stat_statements extension. For MySQL it is collected from the slow query log and the performance_schema digest tables on the primary node.

How it works

Collection is asynchronous. First, POST to request a collection task. Then poll the GET endpoint until the task completes.

Step 1: Request collection

# Collect top 20 queries sorted by total execution time (default)
curl -u admin:password -X POST \
  "https://api.foundrydb.com/managed-services/{id}/query-stats?limit=20&sort_by=total_time"
# Returns: {"task_id": "b2c3d4e5-..."}

Step 2: Poll for results

curl -u admin:password \
  "https://api.foundrydb.com/managed-services/{id}/query-stats?task_id=b2c3d4e5-..."
# Returns 202 while in progress, 200 when complete

Fields returned

Each entry in the queries array contains:

Field	Type	Description
`query`	string	Normalized query text (parameters replaced with `$1`, `?`, etc.)
`calls`	integer	Total number of executions since last reset
`total_time`	float (ms)	Total cumulative execution time across all calls
`mean_time`	float (ms)	Average execution time per call
`rows`	integer	Total rows returned or affected across all calls
`cache_hit_ratio`	float (0-1)	Shared block cache hit ratio (PostgreSQL only; `null` for MySQL)

The response envelope also includes total_count (number of queries returned), collected_at (UTC timestamp of collection), and database_type.

Sorting options

Pass sort_by as a query parameter when requesting collection:

Value	Use case
`total_time`	Queries consuming the most cumulative database time (default)
`calls`	Most frequently executed queries, regardless of speed
`mean_time`	Slowest queries on average (catches infrequent but expensive queries)

Resetting statistics

There is no dedicated API endpoint to reset query statistics. To reset pg_stat_statements on a PostgreSQL service, connect as a superuser and run:

SELECT pg_stat_statements_reset();

On MySQL, the performance_schema digest tables reset automatically at server restart. You can also reset them manually:

TRUNCATE TABLE performance_schema.events_statements_summary_by_digest;

After a reset, all counters start from zero. This is useful after a schema change or deployment so that you are measuring only the new workload.

Identifying N+1 queries

N+1 patterns show up as a query with a very high calls count relative to the expected request volume, a low or moderate mean_time, but a very large total_time. Look for queries of the form SELECT ... WHERE id = $1 that are executed thousands of times per minute. The fix is usually to add a batch-loading step (e.g. WHERE id = ANY($1)) or an ORM eager-load option.

Identifying missing indexes

Sort by mean_time and look for queries with high mean execution time but low row counts. A sequential scan on a large table with a low selectivity predicate will appear here. Confirm with EXPLAIN ANALYZE and add an appropriate index. On PostgreSQL you can also query pg_stat_user_tables for tables with high seq_scan counts alongside your query stats to correlate the two.

Exporting Metrics and Logs

Metrics and logs collected by FoundryDB can be pushed continuously to external observability platforms. This lets you consolidate database telemetry alongside your application infrastructure in the tools your team already uses.

Supported destinations are: Datadog, Prometheus Remote Write (Grafana Cloud, Thanos, Cortex, VictoriaMetrics), Generic OTLP (Grafana Cloud, Honeycomb, any OpenTelemetry collector), AWS CloudWatch, Elasticsearch / OpenSearch, BetterStack, and Grafana Loki. Each integration can export metrics, logs, or both, and runs on a configurable interval (default 60 seconds).

To set up an export, go to the Integrations page in the dashboard or use the API. You can create one integration per destination per service, or a single global integration that covers all services. The example below creates a Datadog export via the API:

curl -u admin:password -X POST \
  https://api.foundrydb.com/api/v1/metrics-exports \
  -H "Content-Type: application/json" \
  -d '{
    "service_id": "{service-id}",
    "name": "Datadog Production",
    "destination_type": "datadog",
    "data_type": "both",
    "export_interval_seconds": 60,
    "configuration": {
      "api_key": "YOUR_DATADOG_API_KEY",
      "site": "datadoghq.com"
    }
  }'

For Grafana Loki, Prometheus Remote Write, and OTLP destinations, see the full configuration reference on the Integrations page.

How monitoring works​

Health status​

Metrics API​

Common Metrics​

All engines​

PostgreSQL​

MySQL​

MongoDB​

Valkey​

Kafka​

Alerts​

Create an alert rule​

Rules worth starting with​

List rules​

Delete a rule​

Notification Channels​

Create a webhook channel​

Create an email channel​

Supported channel types​

Query Statistics​

Query Statistics (Full Guide)​

How it works​

Fields returned​

Sorting options​

Resetting statistics​

Identifying N+1 queries​

Identifying missing indexes​

Exporting Metrics and Logs​