> ## Documentation Index
> Fetch the complete documentation index at: https://docs.streamkap.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Disaster Recovery & Failover

> Handling outages and recovery procedures for Streamkap CDC pipelines across sources, destinations, and platform components.

Change Data Capture (CDC) pipelines involve multiple components: a source database, Kafka, the Streamkap platform, and a destination. Each of these can experience outages independently. This page covers what happens during each type of failure and how to recover.

## Streamkap's Built-in Resilience

Streamkap includes several mechanisms that protect against data loss and enable recovery without manual intervention in many scenarios:

| Mechanism                   | Description                                                                                                                                                             |
| :-------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Kafka buffering**         | Data captured from sources is persisted in Kafka topics. If a destination goes offline, messages accumulate in Kafka and are delivered once the destination recovers.   |
| **Automatic retry**         | Destination connectors automatically retry failed writes with backoff, handling transient errors without operator intervention.                                         |
| **Dead letter queue (DLQ)** | Messages that cannot be processed after retries are routed to a [dead letter queue](/dlq-operations) rather than blocking the pipeline. Healthy data continues to flow. |
| **Offset tracking**         | Each pipeline tracks its position (offset) in the Kafka topic. When a pipeline restarts, it resumes from the last successfully committed offset, preventing data loss.  |
| **Incremental snapshots**   | If a pipeline falls too far behind or a source is rebuilt, [snapshots](/snapshots) can re-backfill historical data without interrupting real-time streaming.            |

<Note>
  Streamkap provides **at-least-once** delivery semantics. Every change event is delivered at least once, but in failure or recovery scenarios, some events may be delivered more than once. **Upsert mode** handles duplicates automatically via primary key deduplication. **Insert/append mode** may result in duplicate rows — use [metadata columns](/metadata) to deduplicate at query time.
</Note>

## Failure Scenarios

<AccordionGroup>
  <Accordion title="Source Database Outage">
    **What happens:**

    * The source connector detects the connection loss and enters an error or retry state.
    * The pipeline status changes to **Broken** in the Streamkap UI.
    * No new change events are captured while the source is unavailable.
    * Data already in Kafka continues to flow to the destination.

    **Recovery:**

    1. Restore the source database and verify it is accepting connections.
    2. The source connector automatically attempts to reconnect. Once the source is reachable, the connector resumes reading from its last known position in the change log (WAL, binlog, oplog, or change stream).
    3. If the source was rebuilt from a backup or replica and the change log position is no longer available, you may need to trigger a [snapshot](/snapshots) to backfill affected tables.

    <Warning>
      **PostgreSQL:** Ensure the replication slot is preserved during any source database failover or maintenance. If the replication slot is dropped, the connector cannot resume from its previous position and a full snapshot is required. See the [PostgreSQL source documentation](/postgresql) for replication slot configuration.
    </Warning>
  </Accordion>

  <Accordion title="Destination Outage">
    **What happens:**

    * Data continues to be captured from the source and buffered in Kafka topics.
    * The destination connector retries writes automatically.
    * Messages that fail after all retries are routed to the [DLQ](/dlq-operations).
    * No data loss is expected as long as the outage duration is within the Kafka retention period.

    **Recovery:**

    1. Restore the destination and verify it is accepting connections and writes.
    2. The destination connector automatically resumes writing buffered data from Kafka.
    3. Monitor [consumer group lag](/consumer-groups) after recovery to confirm the pipeline is catching up.
    4. Check the [DLQ](/dlq-operations) for any messages that failed during the outage window and resolve them.

    <Note>
      During recovery, you may observe temporarily elevated lag as the pipeline processes the backlog of buffered messages. This is expected and will decrease as the pipeline catches up.
    </Note>
  </Accordion>

  <Accordion title="Kafka / Platform Outage (Streamkap Cloud)">
    **What happens:**

    * Pipeline processing pauses entirely. No data is captured from sources or written to destinations.
    * Source databases continue to accumulate changes in their change logs (WAL, binlog, oplog).

    **Recovery:**

    * For [Streamkap Cloud](/streamkap-cloud) deployments, recovery is managed by the Streamkap team with 24/7 monitoring.
    * Once the platform recovers, pipelines automatically resume from their last committed offsets.
    * Source connectors reconnect and read accumulated change log entries.
  </Accordion>

  <Accordion title="Source Database Failover (HA / Replica Promotion)">
    When a source database undergoes a planned or unplanned failover (e.g., primary to replica promotion), the behavior depends on the database type:

    **PostgreSQL:**

    * Replication slots are tied to the primary instance. After failover to a standby, the slot typically needs to be recreated on the new primary.
    * If the slot is lost, the connector cannot resume from its previous position. A snapshot may be required.
    * Consider using logical replication slot failover capabilities if your PostgreSQL version and HA solution support them.

    **MySQL:**

    * If GTID-based replication is enabled and GTIDs are consistent across the primary and replica, the connector can resume seamlessly after failover.
    * Without GTIDs, the connector relies on binlog file and position, which may not transfer across instances. A snapshot may be required.

    **MongoDB:**

    * Change stream resume tokens allow seamless failover within a replica set. When the primary changes, the connector resumes from its last resume token automatically.
    * For sharded clusters, each shard's change stream resumes independently.
  </Accordion>

  <Accordion title="Network Interruption">
    **What happens:**

    * The pipeline detects connectivity loss and enters a retry loop.
    * Transient interruptions (seconds to minutes) are typically handled automatically.

    **Recovery:**

    * **Transient interruptions:** The pipeline reconnects and resumes automatically once connectivity is restored. No action needed.
    * **Extended interruptions:** If connectivity is not restored within the retry window, the pipeline may transition to a **Broken** state.
      1. Verify network connectivity between Streamkap and your source/destination.
      2. Check VPN or [PrivateLink](/connection-options) status if applicable.
      3. Once connectivity is restored, restart the pipeline from the Streamkap UI if it does not resume automatically.

    <Tip>
      For [Bring Your Own Cloud (BYOC) deployments](/bring-your-own-cloud-byoc), verify that the VPN tunnel between the control plane and data plane is active, and that the data plane can reach your source and destination endpoints.
    </Tip>
  </Accordion>
</AccordionGroup>

## Post-Outage Verification Checklist

After recovering from any outage, work through this checklist to confirm your pipelines are healthy:

<Steps>
  <Step title="Verify pipeline status">
    Navigate to the [Pipelines](/pipelines) page and confirm all affected pipelines show an **Active** (green) status.
  </Step>

  <Step title="Check consumer group lag">
    Open the [Consumer Groups](/consumer-groups) page and verify that lag for affected consumer groups is **decreasing**. Sustained or increasing lag after recovery indicates a problem.
  </Step>

  <Step title="Verify data flow to destination">
    Query your destination to confirm new data is arriving. Check the latest timestamps on recently updated tables to verify freshness.
  </Step>

  <Step title="Review the dead letter queue">
    Check the [DLQ](/dlq-operations) for messages that failed during the outage or recovery window. Resolve or replay any failed messages.
  </Step>

  <Step title="Compare source and destination counts">
    For critical tables, compare row counts between the source database and the destination to confirm they are in sync. If discrepancies exist, consider triggering a [snapshot](/snapshots) for the affected tables.
  </Step>
</Steps>

## Configuration Backup & Recovery

Maintaining recoverable pipeline configurations reduces the time and effort needed to rebuild after a major incident.

### API Export

Pipeline and connector configurations can be retrieved programmatically via the [Streamkap API](/api). Use the API to export configuration state for backup purposes or to recreate resources in a new environment.

### Terraform (Recommended)

The [Streamkap Terraform Provider](/streamkap-provider-for-terraform) enables you to define all sources, destinations, pipelines, and transforms as infrastructure-as-code. This provides:

* **Version-controlled configuration** tracked in Git alongside your application code
* **Reproducible deployments** that can recreate your entire pipeline topology from code
* **Disaster recovery** by re-applying Terraform configurations to a new environment
* **Change auditing** through standard Git history and pull request reviews

<Tip>
  Maintain your Streamkap Terraform configurations in version control and treat them as the source of truth for your pipeline infrastructure. This enables rapid recovery by re-applying configurations if resources need to be recreated.
</Tip>

For setup instructions, see:

* [Terraform Getting Started](/streamkap-provider-for-terraform)
* [Terraform Configuration](/terraform-configuration)
* [Terraform Resources](/terraform-resources)

## Related Documentation

* [Pipelines](/pipelines) - Monitor pipeline status and manage data flow
* [Snapshots & Backfilling](/snapshots) - Re-backfill data after outages or schema changes
* [Deployment Options](/deployment-options) - Streamkap Cloud and BYOC deployment models
