Skip to main content

Data Pipelines

A data pipeline is a managed, first-class data flow between two services you already own. You point a source at a sink, the platform builds the network path, installs any required connectors, prepares the source database, and drives the pipeline to Running. You observe and delete it via the API.

What a data pipeline is

Data pipelines connect existing managed services. The connection itself — the private network peering, the connector plugin, the replication configuration, the connector process — is a resource the platform owns and reconciles. If the source primary fails over, the pipeline re-points itself and restarts. If the controller restarts mid-provisioning, it resumes from where it left off.

This is distinct from pipeline templates, which deploy pre-configured groups of new services (such as a RAG pipeline with Kafka + PostgreSQL + Valkey). Data pipelines connect services that already exist in your organization.

Available pipeline types

TypeSourceSinkMechanism
cdc_pg_to_kafkaPostgreSQLKafka + Kafka Connect addonDebezium PostgreSQL connector via logical replication

Additional pipeline types (Kafka to OpenSearch, PostgreSQL to pgvector) are planned. Today's surface is PostgreSQL CDC into Kafka.

How it works: PostgreSQL CDC to Kafka

When you create a cdc_pg_to_kafka pipeline, a reconciler walks it through five provisioning steps:

Pending
└── validate source (PostgreSQL, Running) and sink (Kafka + Connect, Running)
Provisioning
├── ensure_network — bidirectional SDN peering between source and sink subnets
├── install_plugin — Debezium PostgreSQL connector installed on the Kafka Connect worker
├── prepare_source — REPLICATION granted to the connector user; sink subnet admitted in pg_hba.conf
├── create_publication — platform-owned logical replication publication on the source
└── create_connector — Debezium connector created on the Connect cluster; waits for RUNNING
Running
└── health checked every 60 seconds; connector re-pointed after source failover

Every step is idempotent. A controller restart or transient network failure resumes from the last completed step without creating duplicate resources.

Network path

Pipeline data traffic travels over the services' private SDN subnets, never the shared utility network. The platform creates a bidirectional peering between the source and sink SDN networks. Nothing is opened on the public internet to make the pipeline work.

Source and sink must be in the same UpCloud peering region. Cross-region pipelines are not supported in Phase 1.

Lifecycle states

StatusMeaning
PendingPipeline accepted; reconciler validating source and sink
ProvisioningReconciler working through provisioning steps; see provision_step
RunningConnector is up and streaming
PausedConnector paused (e.g., manual pause via Kafka Connect API)
FailedA provisioning step or connector reached a terminal error; see error_message
DeletingDeletion in progress

The provision_step field narrows Provisioning to one of: ensure_network, install_plugin, prepare_source, create_publication, create_connector.

Teardown

Deleting a pipeline reverses provisioning in order: the connector is deleted first (releasing the replication slot), then the replication slot is dropped, then the publication, then the pg_hba entry added by the pipeline. The source returns to exactly the state it was in before.

The cross-service SDN peering is intentionally kept. Other pipelines between the same two services may share it, and an idle peering carries no traffic. The peering is removed when the underlying service is deleted.

Delivery guarantee

CDC pipelines use logical replication. The delivery guarantee across normal operation is at-least-once per Kafka offset: Debezium commits offsets only after Kafka acknowledges receipt.

Across a source failover, the guarantee is at-least-once relative to what the new primary retained. Logical replication slots are not transferred to promoted replicas (a PostgreSQL limitation before version 17 failover slots). After failover, the reconciler re-points the connector at the new primary's private IP, refreshes credentials, and restarts it. Debezium recreates the slot and resumes from its committed Kafka offsets. Events already written to Kafka are not lost. Events committed on the old primary after the slot position but before the failover may be re-delivered or, in rare lossy failovers, missed.

Design downstream consumers to be idempotent.

Difference from pipeline templates

Data pipelinesPipeline templates
PurposeWire existing services togetherDeploy a new multi-service topology
InputSource service ID + sink service IDTemplate name + zone + plan
OutputA managed connector between your servicesSeveral new managed services
APIPOST /organizations/{orgId}/pipelinesPOST /pipelines
DocsThis sectionPipeline Templates

Next steps