Skip to main content
Change Data Capture (CDC) pipelines involve multiple components: a source database, Kafka, the Streamkap platform, and a destination. Each of these can experience outages independently. This page covers what happens during each type of failure and how to recover.

Streamkap’s Built-in Resilience

Streamkap includes several mechanisms that protect against data loss and enable recovery without manual intervention in many scenarios:
MechanismDescription
Kafka bufferingData captured from sources is persisted in Kafka topics. If a destination goes offline, messages accumulate in Kafka and are delivered once the destination recovers.
Automatic retryDestination connectors automatically retry failed writes with backoff, handling transient errors without operator intervention.
Dead letter queue (DLQ)Messages that cannot be processed after retries are routed to a dead letter queue rather than blocking the pipeline. Healthy data continues to flow.
Offset trackingEach pipeline tracks its position (offset) in the Kafka topic. When a pipeline restarts, it resumes from the last successfully committed offset, preventing data loss.
Incremental snapshotsIf a pipeline falls too far behind or a source is rebuilt, snapshots can re-backfill historical data without interrupting real-time streaming.
Streamkap provides at-least-once delivery semantics. Every change event is delivered at least once, but in failure or recovery scenarios, some events may be delivered more than once. Upsert mode handles duplicates automatically via primary key deduplication. Insert/append mode may result in duplicate rows — use metadata columns to deduplicate at query time.

Failure Scenarios

What happens:
  • The source connector detects the connection loss and enters an error or retry state.
  • The pipeline status changes to Broken in the Streamkap UI.
  • No new change events are captured while the source is unavailable.
  • Data already in Kafka continues to flow to the destination.
Recovery:
  1. Restore the source database and verify it is accepting connections.
  2. The source connector automatically attempts to reconnect. Once the source is reachable, the connector resumes reading from its last known position in the change log (WAL, binlog, oplog, or change stream).
  3. If the source was rebuilt from a backup or replica and the change log position is no longer available, you may need to trigger a snapshot to backfill affected tables.
PostgreSQL: Ensure the replication slot is preserved during any source database failover or maintenance. If the replication slot is dropped, the connector cannot resume from its previous position and a full snapshot is required. See the PostgreSQL source documentation for replication slot configuration.
What happens:
  • Data continues to be captured from the source and buffered in Kafka topics.
  • The destination connector retries writes automatically.
  • Messages that fail after all retries are routed to the DLQ.
  • No data loss is expected as long as the outage duration is within the Kafka retention period.
Recovery:
  1. Restore the destination and verify it is accepting connections and writes.
  2. The destination connector automatically resumes writing buffered data from Kafka.
  3. Monitor consumer group lag after recovery to confirm the pipeline is catching up.
  4. Check the DLQ for any messages that failed during the outage window and resolve them.
During recovery, you may observe temporarily elevated lag as the pipeline processes the backlog of buffered messages. This is expected and will decrease as the pipeline catches up.
What happens:
  • Pipeline processing pauses entirely. No data is captured from sources or written to destinations.
  • Source databases continue to accumulate changes in their change logs (WAL, binlog, oplog).
Recovery:
  • For Streamkap Cloud deployments, recovery is managed by the Streamkap team with 24/7 monitoring.
  • Once the platform recovers, pipelines automatically resume from their last committed offsets.
  • Source connectors reconnect and read accumulated change log entries.
When a source database undergoes a planned or unplanned failover (e.g., primary to replica promotion), the behavior depends on the database type:PostgreSQL:
  • Replication slots are tied to the primary instance. After failover to a standby, the slot typically needs to be recreated on the new primary.
  • If the slot is lost, the connector cannot resume from its previous position. A snapshot may be required.
  • Consider using logical replication slot failover capabilities if your PostgreSQL version and HA solution support them.
MySQL:
  • If GTID-based replication is enabled and GTIDs are consistent across the primary and replica, the connector can resume seamlessly after failover.
  • Without GTIDs, the connector relies on binlog file and position, which may not transfer across instances. A snapshot may be required.
MongoDB:
  • Change stream resume tokens allow seamless failover within a replica set. When the primary changes, the connector resumes from its last resume token automatically.
  • For sharded clusters, each shard’s change stream resumes independently.
What happens:
  • The pipeline detects connectivity loss and enters a retry loop.
  • Transient interruptions (seconds to minutes) are typically handled automatically.
Recovery:
  • Transient interruptions: The pipeline reconnects and resumes automatically once connectivity is restored. No action needed.
  • Extended interruptions: If connectivity is not restored within the retry window, the pipeline may transition to a Broken state.
    1. Verify network connectivity between Streamkap and your source/destination.
    2. Check VPN or PrivateLink status if applicable.
    3. Once connectivity is restored, restart the pipeline from the Streamkap UI if it does not resume automatically.
For Bring Your Own Cloud (BYOC) deployments, verify that the VPN tunnel between the control plane and data plane is active, and that the data plane can reach your source and destination endpoints.

Post-Outage Verification Checklist

After recovering from any outage, work through this checklist to confirm your pipelines are healthy:
1

Verify pipeline status

Navigate to the Pipelines page and confirm all affected pipelines show an Active (green) status.
2

Check consumer group lag

Open the Consumer Groups page and verify that lag for affected consumer groups is decreasing. Sustained or increasing lag after recovery indicates a problem.
3

Verify data flow to destination

Query your destination to confirm new data is arriving. Check the latest timestamps on recently updated tables to verify freshness.
4

Review the dead letter queue

Check the DLQ for messages that failed during the outage or recovery window. Resolve or replay any failed messages.
5

Compare source and destination counts

For critical tables, compare row counts between the source database and the destination to confirm they are in sync. If discrepancies exist, consider triggering a snapshot for the affected tables.

Configuration Backup & Recovery

Maintaining recoverable pipeline configurations reduces the time and effort needed to rebuild after a major incident.

API Export

Pipeline and connector configurations can be retrieved programmatically via the Streamkap API. Use the API to export configuration state for backup purposes or to recreate resources in a new environment. The Streamkap Terraform Provider enables you to define all sources, destinations, pipelines, and transforms as infrastructure-as-code. This provides:
  • Version-controlled configuration tracked in Git alongside your application code
  • Reproducible deployments that can recreate your entire pipeline topology from code
  • Disaster recovery by re-applying Terraform configurations to a new environment
  • Change auditing through standard Git history and pull request reviews
Maintain your Streamkap Terraform configurations in version control and treat them as the source of truth for your pipeline infrastructure. This enables rapid recovery by re-applying configurations if resources need to be recreated.
For setup instructions, see: