> ## Documentation Index
> Fetch the complete documentation index at: https://docs.streamkap.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Pipeline Recovery Procedures

> Troubleshoot and recover Streamkap pipelines using structured decision trees, step-by-step procedures, and scenario-based guidance.

This guide helps you diagnose and resolve pipeline issues in Streamkap. Use it when a pipeline is in a Broken state, data has stopped flowing, or you observe data discrepancies between source and destination.

For routine pipeline management, see [Pipelines](/pipelines). For proactive monitoring, see [Alerts](/alerts).

## Recovery Decision Tree

Use the following flow to determine the correct recovery action for your situation.

### Pipeline is in Broken status

1. Go to the [Logs](/logs) page and filter by the affected connector at the **ERROR** level
2. Identify the error type:
   * **Connection error** (network timeout, authentication failure) — fix connectivity, then restart the connector via the pipeline actions menu
   * **Schema error** (type mismatch, missing column) — fix the schema at source or destination, then monitor for recovery
   * **Permission error** (access denied, insufficient privileges) — grant required permissions, then restart the connector
   * **Resource error** (disk full, memory exhaustion) — free resources or scale infrastructure, then restart
3. If the error persists after restarting, consider a stop and resume cycle or contact [Streamkap support](mailto:support@streamkap.com)

### Pipeline is running but data is not flowing

1. Check the pipeline's **Lag** metric — if lag is increasing, the source is producing but the destination is not consuming
2. Check the [Consumer Groups](/consumer-groups) page for the destination's consumer group:
   * Is the consumer group in `STABLE` state with active members?
   * Is consumer lag growing or static?
3. Check the source connector status — is it Active or Broken?
4. Check the [Logs](/logs) page for WARN or ERROR messages from either connector
5. If the source connector is healthy but the destination is not consuming, stop and resume the destination connector

### Data at destination does not match source

1. Check the [DLQ (Dead Letter Queue)](/dlq-operations) for failed messages — schema mismatches or constraint violations can cause records to be diverted
2. Review the [Logs](/logs) for any processing errors or warnings
3. If specific records are missing, consider a [filtered (partial) snapshot](/snapshots#snapshot-options) for the affected time range
4. If broad data discrepancy is detected, consider a [full snapshot](/snapshots) of the affected tables
5. For Snowflake destinations in append mode, verify that both consumer group offsets and Snowpipe Streaming channel offsets are aligned — see [Snowflake Offset Management](/snowflake#offset-management-append-mode)

## Recovery Actions

The following table summarizes the available recovery actions, when to use each, and their impact on data.

| Action                           | What It Does                                                                                            | When to Use                                                                                | Data Impact                                                                                                              |
| -------------------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------ |
| **Restart connector**            | Stops and immediately restarts the source or destination connector                                      | Transient errors, connection timeouts, minor configuration changes                         | None — resumes from last committed offset                                                                                |
| **Stop + Resume**                | Manually stops a connector, then resumes it after a pause                                               | Persistent errors that need time to resolve (e.g., waiting for infrastructure fixes)       | None — resumes from last committed offset                                                                                |
| **Reset consumer group offsets** | Changes the consumer group's offset position (earliest, latest, specific timestamp, or specific offset) | Need to replay messages, skip problematic records, or recover from a known-good position   | May cause data duplication (if reset to earlier offset) or data loss (if reset to later offset)                          |
| **Snapshot**                     | Triggers a full or filtered snapshot to backfill historical data from source                            | Missing data at destination, post-schema-change backfill, initial data load for new tables | Destination receives snapshot data in addition to ongoing CDC; no data loss but may temporarily increase latency and lag |

<Warning>
  Resetting consumer group offsets and snapshotting are powerful operations. Always verify the impact on your destination before proceeding, especially in production environments.
</Warning>

## Recovery Scenarios

<AccordionGroup>
  <Accordion title="Pipeline shows FAILED / Broken status">
    **Symptoms:**

    * Pipeline status shows **Broken** (red badge) in the Pipelines list
    * An info icon appears next to the status with error details

    **Diagnosis:**

    1. Click the info icon next to the Broken status to see the error summary
    2. Navigate to the [Logs](/logs) page, filter by the affected connector, and set the log level to **ERROR**
    3. Expand error messages to view stack traces and identify the root cause

    **Common causes and resolutions:**

    | Cause                                        | Resolution                                                                                                                          |
    | -------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
    | Source database connection lost              | Verify network connectivity, firewall rules, and SSH tunnel/VPN status. Restart the source connector once connectivity is restored. |
    | Destination authentication failure           | Check that destination credentials are valid and have not been rotated. Update connector settings if needed.                        |
    | Schema incompatibility                       | Compare source and destination schemas. Fix mismatches and restart the connector. See [DLQ](/dlq-operations) for failed messages.   |
    | Resource exhaustion on source or destination | Free disk space, increase memory, or scale the database. Restart the connector after resources are available.                       |

    **Recovery:**

    * Fix the underlying issue, then use **Resume source** or **Resume destination** from the pipeline's [row actions menu](/pipelines#row-actions-menu)
    * If the connector does not recover after resuming, stop it, wait 30 seconds, and resume again
  </Accordion>

  <Accordion title="Pipeline is running but no data is flowing">
    **Symptoms:**

    * Pipeline status shows **Active** but lag is not decreasing or latency is not updating
    * Destination tables are not receiving new records
    * Consumer group lag is static or growing

    **Diagnosis:**

    1. Check the pipeline's **Lag** and **Latency** metrics on the [Pipelines](/pipelines) page
    2. Navigate to [Consumer Groups](/consumer-groups) and locate the destination's consumer group
       * Verify the group is in `STABLE` state
       * Check the **Total Lag** metric and per-partition consumer lag
    3. Check the source connector status — is it still streaming? Look for recent INFO-level log messages
    4. Review [Logs](/logs) for WARN or ERROR messages from both source and destination connectors

    **Common causes and resolutions:**

    | Cause                                | Resolution                                                                                                  |
    | ------------------------------------ | ----------------------------------------------------------------------------------------------------------- |
    | Source connector stopped or paused   | Resume the source connector from the pipeline actions menu                                                  |
    | Destination connector stuck          | Stop and resume the destination connector                                                                   |
    | Consumer group has no active members | Verify the destination connector is running; restart if the consumer group shows `EMPTY` state              |
    | Network issue between components     | Check network connectivity; review firewall and security group rules                                        |
    | Source database has no new changes   | Confirm that changes are being made to the source tables; this may be expected behavior during idle periods |
  </Accordion>

  <Accordion title="Data at destination does not match source">
    **Symptoms:**

    * Row counts differ between source and destination
    * Specific records are missing or outdated at the destination
    * Column values differ between source and destination

    **Diagnosis:**

    1. Check the [DLQ](/dlq-operations) for messages that failed delivery — these records were diverted instead of being written to the destination
    2. Review the [Logs](/logs) page for schema errors, type conversion warnings, or constraint violations
    3. Verify that the destination schema matches the source schema (column types, nullable constraints, primary keys)
    4. Check if a snapshot was recently cancelled or failed — incomplete snapshots can leave gaps in historical data

    **Common causes and resolutions:**

    | Cause                                | Resolution                                                                                                                                                                   |
    | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | Messages in the DLQ                  | Fix the root cause (schema mismatch, permission error, size limit), then snapshot affected tables to backfill missing records. See [DLQ Recovery](/dlq-operations#recovery). |
    | Incomplete or failed snapshot        | Re-trigger a snapshot for the affected tables. See [Failed Snapshot Recovery](/snapshots#failed-snapshot-recovery).                                                          |
    | Schema change without snapshot       | Trigger a snapshot to backfill historical rows with the new schema. See [Snapshotting After Schema Changes](/snapshots#snapshotting-after-schema-changes).                   |
    | Consumer group offset desynchronized | Reset consumer group offsets to replay missed messages. Follow the [offset reset procedure](#how-to-reset-consumer-group-offsets) below.                                     |
  </Accordion>

  <Accordion title="Pipeline performance degraded">
    **Symptoms:**

    * Pipeline latency is significantly higher than normal
    * Consumer lag is growing steadily
    * Data arrives at the destination with increasing delay

    **Diagnosis:**

    1. Check the pipeline's **Lag** and **Latency** on the [Pipelines](/pipelines) page
    2. Navigate to [Consumer Groups](/consumer-groups) and check per-partition lag to identify bottlenecks
    3. Review [Logs](/logs) for slow query warnings or timeout messages
    4. Check if a snapshot is currently running — snapshots increase load and may temporarily degrade performance

    **Common causes and resolutions:**

    | Cause                             | Resolution                                                                                                                                                                   |
    | --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | Active snapshot in progress       | This is expected. Wait for the snapshot to complete. Lag and latency will normalize after backfilling finishes.                                                              |
    | Insufficient consumer parallelism | Increase the **Tasks** setting on the destination connector to add more parallel consumers. See [Consumer Groups - Performance Tuning](/consumer-groups#performance-tuning). |
    | Destination write bottleneck      | Check destination database performance. Consider scaling destination resources or optimizing table indexes.                                                                  |
    | High source database load         | Schedule snapshots during off-peak hours. Review source database query performance.                                                                                          |
    | Partition count too low           | Increase topic partition count to enable higher parallelism. See [Topics](/topics).                                                                                          |

    <Note>
      Increased latency and lag may be expected when snapshots are running. Backfilling produces higher load than normal CDC streaming, but the load is temporary.
    </Note>
  </Accordion>

  <Accordion title="MongoDB ChangeStreamFatalError — resume token expired">
    **Symptoms:**

    * Pipeline shows Broken status with `ChangeStreamFatalError` in logs
    * Connector cannot resume from its last position
    * Error 280 appears in connector logs

    **Diagnosis:**

    1. Check the [Logs](/logs) page for `ChangeStreamFatalError` messages
    2. Verify the MongoDB oplog retention period — if the connector was offline longer than the oplog retention, the resume token has expired

    **Resolution:**

    1. Contact [Streamkap support](mailto:support@streamkap.com) for an offset reset
    2. After the offset reset, trigger a [snapshot](/snapshots) to backfill any data missed during the outage
    3. Increase oplog retention to 48 hours or more to prevent recurrence:
       ```javascript theme={null}
       db.adminCommand({ replSetResizeOplog: 1, minRetentionHours: 48 })
       ```

    **Related:** [Error Reference — MongoDB ChangeStreamFatalError](/error-reference#replication) | [MongoDB Source FAQ](/mongodb-source-faq)
  </Accordion>

  <Accordion title="MySQL schema recovery — 'Schema isn't known to this connector'">
    **Symptoms:**

    * Pipeline shows Broken status with schema-related errors
    * Error message: "Schema isn't known to this connector"
    * Typically affects databases with 1000+ tables

    **Diagnosis:**

    1. Check the [Logs](/logs) page for schema-related error messages
    2. Verify the number of tables in the source database — this issue is more common with large database instances

    **Resolution:**

    1. Contact [Streamkap support](mailto:support@streamkap.com) for schema history recovery
    2. After recovery, trigger a [snapshot](/snapshots) of affected tables to ensure data consistency
    3. Consider enabling **Capture Only Captured Tables DDL** in the source's Advanced settings to prevent recurrence

    **Related:** [Error Reference — MySQL Schema Error](/error-reference#schema) | [Schema History Optimization](/schema-history-optimization)
  </Accordion>

  <Accordion title="Pipeline stuck after source database maintenance">
    **Symptoms:**

    * Pipeline was working before a source database restart, failover, or maintenance window
    * Pipeline shows Broken status or is running but no new data is arriving
    * Logs show connection errors or "connection refused" messages

    **Diagnosis:**

    1. Confirm that the source database is back online and accepting connections
    2. Verify that the database user credentials and network configuration have not changed
    3. Check [Logs](/logs) for connection error messages from the source connector
    4. For PostgreSQL sources, verify that the replication slot still exists and has not been dropped during maintenance
    5. For MySQL sources, verify that binary logging is still enabled and the binlog has not been purged past the connector's position

    **Resolution:**

    1. Verify source database connectivity and credentials
    2. Resume the source connector from the pipeline actions menu
    3. If the source connector does not recover:
       * Stop the source connector
       * Wait for the source database to be fully available
       * Resume the source connector
    4. If the connector reports that its position (replication slot, binlog position) is no longer valid:
       * The source connector may need to be reconfigured
       * Trigger a snapshot to re-establish the data baseline
       * Contact [Streamkap support](mailto:support@streamkap.com) if the issue persists

    <Warning>
      For PostgreSQL sources, if the replication slot was dropped during maintenance, the connector cannot resume from its previous position. A snapshot will be required to re-establish data consistency.
    </Warning>
  </Accordion>
</AccordionGroup>

## Step-by-Step Recovery Procedures

### How to restart a pipeline connector

<Steps>
  <Step title="Navigate to the Pipelines page">
    Go to the **Pipelines** page from the sidebar and locate the affected pipeline.
  </Step>

  <Step title="Open the actions menu">
    Click the actions menu (three dots) on the pipeline row.
  </Step>

  <Step title="Stop the connector">
    Select **Stop source** or **Stop destination** depending on which connector needs to be restarted.
  </Step>

  <Step title="Wait briefly">
    Wait approximately 10-30 seconds for the connector to fully stop. You can verify by checking the connector status on the pipeline detail page.
  </Step>

  <Step title="Resume the connector">
    Open the actions menu again and select **Resume source** or **Resume destination**.
  </Step>

  <Step title="Verify recovery">
    Monitor the pipeline's **Status**, **Lag**, and **Latency** metrics to confirm the connector has recovered. Check the [Logs](/logs) page for any new error messages.
  </Step>
</Steps>

<Info>
  **Is restart different from stop + resume?** In Streamkap, restarting a connector is effectively a stop followed by a resume. Both operations preserve the connector's last committed offset position, so no data is lost. The distinction matters primarily when you need to make changes (fix permissions, update credentials, wait for infrastructure) between stopping and resuming.
</Info>

### How to reset consumer group offsets

<Steps>
  <Step title="Stop the destination connector">
    From the pipeline's [row actions menu](/pipelines#row-actions-menu), select **Stop destination**. All consumers in the group must be stopped before offsets can be reset.
  </Step>

  <Step title="Navigate to the consumer group">
    Go to the [Consumer Groups](/consumer-groups) page and find the consumer group associated with your destination connector.
  </Step>

  <Step title="Select partitions to reset">
    In the **Topic Partitions** table, check the boxes for the partitions you want to reset. You can select individual partitions or all partitions for a topic.
  </Step>

  <Step title="Click Reset Offsets">
    Click the **Reset Offsets** button. A dialog will appear with reset options.
  </Step>

  <Step title="Choose a reset strategy">
    Select the appropriate strategy:

    * **Earliest** — replay all available messages from the beginning of the retention window
    * **Latest** — skip to the end and only process new messages going forward
    * **Specific Timestamp** — reset to the first offset after a given timestamp
    * **Specific Offset** — set a precise offset position
  </Step>

  <Step title="Apply the reset">
    Review your selections and click **Apply**.
  </Step>

  <Step title="Handle Snowflake destinations (if applicable)">
    For Snowflake destinations in append mode, you must also reset the Snowpipe Streaming channel offsets to `-1`. See [Snowflake Offset Management](/snowflake#offset-management-append-mode) for the required SQL commands.
  </Step>

  <Step title="Resume the destination connector">
    Go back to the pipeline's actions menu and select **Resume destination**. The consumers will start processing from the new offset positions.
  </Step>
</Steps>

<Warning>
  Resetting offsets to **Earliest** on large topics will cause re-processing of all retained messages, which may take considerable time and could result in duplicate data at the destination. Streamkap retains topic data based on your project's retention policy. Only messages within the retention window can be replayed — check your project settings for the configured retention period.
</Warning>

### How to trigger a snapshot

<Steps>
  <Step title="Navigate to the source connector">
    Go to the source connector detail page from the pipeline detail view or the Sources page.
  </Step>

  <Step title="Choose snapshot scope">
    Decide whether to snapshot all tables (source-level snapshot) or specific tables (table-level snapshot):

    * **Source level:** Use the source connector's actions menu and select **Snapshot**
    * **Table level:** Find the specific topic in the Topics list and use its actions menu to select **Snapshot**
  </Step>

  <Step title="Select snapshot type">
    Choose between:

    * **Full (Complete)** — captures all rows from the selected tables
    * **Filtered (Partial)** — captures only rows matching a filter condition (useful for backfilling specific time ranges)
  </Step>

  <Step title="Confirm the snapshot">
    Type "snapshot" in the confirmation dialog to begin the operation.
  </Step>

  <Step title="Monitor progress">
    The source connector status will update to reflect the snapshot operation. Monitor progress on the source connector detail page and in the Topics list.
  </Step>

  <Step title="Verify completion">
    Once the snapshot completes, verify that the expected data is present at the destination. Check the pipeline's Lag metric — it should decrease as the snapshot data is processed.
  </Step>
</Steps>

<Info>
  Snapshots run in parallel with ongoing CDC streaming. Your real-time data flow continues uninterrupted while historical data is being backfilled. Expect temporarily increased latency and lag during snapshot operations.
</Info>

<Info>
  Recovery operation timing varies based on data volume, source database performance, and destination write capacity. Small table snapshots typically complete in minutes; large tables (100M+ rows) may take several hours. Offset resets take effect immediately, but reprocessing time depends on the volume of messages being replayed.
</Info>

## Related Documentation

* [Pipelines](/pipelines) - Monitor pipeline health, manage connectors, and view performance metrics
* [Consumer Groups](/consumer-groups) - Monitor consumer lag, inspect members, and reset offsets
* [Snapshots & Backfilling](/snapshots) - Trigger and manage snapshots for data backfilling
* [Dead Letter Queue (DLQ)](/dlq-operations) - Inspect and resolve messages that failed processing