Streamkap’s Built-in Resilience
Streamkap includes several mechanisms that protect against data loss and enable recovery without manual intervention in many scenarios:| Mechanism | Description |
|---|---|
| Kafka buffering | Data captured from sources is persisted in Kafka topics. If a destination goes offline, messages accumulate in Kafka and are delivered once the destination recovers. |
| Automatic retry | Destination connectors automatically retry failed writes with backoff, handling transient errors without operator intervention. |
| Dead letter queue (DLQ) | Messages that cannot be processed after retries are routed to a dead letter queue rather than blocking the pipeline. Healthy data continues to flow. |
| Offset tracking | Each pipeline tracks its position (offset) in the Kafka topic. When a pipeline restarts, it resumes from the last successfully committed offset, preventing data loss. |
| Incremental snapshots | If a pipeline falls too far behind or a source is rebuilt, snapshots can re-backfill historical data without interrupting real-time streaming. |
Streamkap provides at-least-once delivery semantics. Every change event is delivered at least once, but in failure or recovery scenarios, some events may be delivered more than once. Upsert mode handles duplicates automatically via primary key deduplication. Insert/append mode may result in duplicate rows — use metadata columns to deduplicate at query time.
Failure Scenarios
Source Database Outage
Source Database Outage
What happens:
- The source connector detects the connection loss and enters an error or retry state.
- The pipeline status changes to Broken in the Streamkap UI.
- No new change events are captured while the source is unavailable.
- Data already in Kafka continues to flow to the destination.
- Restore the source database and verify it is accepting connections.
- The source connector automatically attempts to reconnect. Once the source is reachable, the connector resumes reading from its last known position in the change log (WAL, binlog, oplog, or change stream).
- If the source was rebuilt from a backup or replica and the change log position is no longer available, you may need to trigger a snapshot to backfill affected tables.
Destination Outage
Destination Outage
What happens:
- Data continues to be captured from the source and buffered in Kafka topics.
- The destination connector retries writes automatically.
- Messages that fail after all retries are routed to the DLQ.
- No data loss is expected as long as the outage duration is within the Kafka retention period.
- Restore the destination and verify it is accepting connections and writes.
- The destination connector automatically resumes writing buffered data from Kafka.
- Monitor consumer group lag after recovery to confirm the pipeline is catching up.
- Check the DLQ for any messages that failed during the outage window and resolve them.
During recovery, you may observe temporarily elevated lag as the pipeline processes the backlog of buffered messages. This is expected and will decrease as the pipeline catches up.
Kafka / Platform Outage (Streamkap Cloud)
Kafka / Platform Outage (Streamkap Cloud)
What happens:
- Pipeline processing pauses entirely. No data is captured from sources or written to destinations.
- Source databases continue to accumulate changes in their change logs (WAL, binlog, oplog).
- For Streamkap Cloud deployments, recovery is managed by the Streamkap team with 24/7 monitoring.
- Once the platform recovers, pipelines automatically resume from their last committed offsets.
- Source connectors reconnect and read accumulated change log entries.
Source Database Failover (HA / Replica Promotion)
Source Database Failover (HA / Replica Promotion)
When a source database undergoes a planned or unplanned failover (e.g., primary to replica promotion), the behavior depends on the database type:PostgreSQL:
- Replication slots are tied to the primary instance. After failover to a standby, the slot typically needs to be recreated on the new primary.
- If the slot is lost, the connector cannot resume from its previous position. A snapshot may be required.
- Consider using logical replication slot failover capabilities if your PostgreSQL version and HA solution support them.
- If GTID-based replication is enabled and GTIDs are consistent across the primary and replica, the connector can resume seamlessly after failover.
- Without GTIDs, the connector relies on binlog file and position, which may not transfer across instances. A snapshot may be required.
- Change stream resume tokens allow seamless failover within a replica set. When the primary changes, the connector resumes from its last resume token automatically.
- For sharded clusters, each shard’s change stream resumes independently.
Network Interruption
Network Interruption
What happens:
- The pipeline detects connectivity loss and enters a retry loop.
- Transient interruptions (seconds to minutes) are typically handled automatically.
- Transient interruptions: The pipeline reconnects and resumes automatically once connectivity is restored. No action needed.
- Extended interruptions: If connectivity is not restored within the retry window, the pipeline may transition to a Broken state.
- Verify network connectivity between Streamkap and your source/destination.
- Check VPN or PrivateLink status if applicable.
- Once connectivity is restored, restart the pipeline from the Streamkap UI if it does not resume automatically.
Post-Outage Verification Checklist
After recovering from any outage, work through this checklist to confirm your pipelines are healthy:Verify pipeline status
Navigate to the Pipelines page and confirm all affected pipelines show an Active (green) status.
Check consumer group lag
Open the Consumer Groups page and verify that lag for affected consumer groups is decreasing. Sustained or increasing lag after recovery indicates a problem.
Verify data flow to destination
Query your destination to confirm new data is arriving. Check the latest timestamps on recently updated tables to verify freshness.
Review the dead letter queue
Check the DLQ for messages that failed during the outage or recovery window. Resolve or replay any failed messages.
Compare source and destination counts
For critical tables, compare row counts between the source database and the destination to confirm they are in sync. If discrepancies exist, consider triggering a snapshot for the affected tables.
Configuration Backup & Recovery
Maintaining recoverable pipeline configurations reduces the time and effort needed to rebuild after a major incident.API Export
Pipeline and connector configurations can be retrieved programmatically via the Streamkap API. Use the API to export configuration state for backup purposes or to recreate resources in a new environment.Terraform (Recommended)
The Streamkap Terraform Provider enables you to define all sources, destinations, pipelines, and transforms as infrastructure-as-code. This provides:- Version-controlled configuration tracked in Git alongside your application code
- Reproducible deployments that can recreate your entire pipeline topology from code
- Disaster recovery by re-applying Terraform configurations to a new environment
- Change auditing through standard Git history and pull request reviews
Related Documentation
- Pipelines - Monitor pipeline status and manage data flow
- Snapshots & Backfilling - Re-backfill data after outages or schema changes
- Deployment Options - Streamkap Cloud and BYOC deployment models