SQL Server on Amazon RDS

SQL Server Change Data Capture on Amazon RDS with Streamkap

Prerequisites

  • SQL Server version ≥ 2016 Service Pack 1 Standard or Enterprise Edition
  • Connection details
  • Streamkap user and role

Granting Privileges

It's recommended to create a separate user and role for Streamkap to access your SQL Server database. Below is an example script that does that.

-- Replace {database}, {password} placeholders
USE {database};
GO
CREATE LOGIN streamkap_user WITH PASSWORD = '{password}';
CREATE USER streamkap_user FOR LOGIN streamkap_user;

CREATE ROLE streamkap_role;
ALTER ROLE streamkap_role ADD MEMBER streamkap_user;

GRANT SELECT ON DATABASE::{database} TO streamkap_role;
GRANT ALTER, SELECT ON SCHEMA::{schema} TO streamkap_role;
GRANT CREATE TABLE TO streamkap_role;
GO

Enable Snapshots

You will need to create the table in the source database and give permissions to the streamkap_role. Streamkap will use this collection for managing snapshots.

❗️

Please create the signal table with the name streamkap_signal in a new schema called streamkap. It will not be recognised if given another name

-- Create the schema
CREATE SCHEMA streamkap;

-- Create the table within the schema
CREATE TABLE streamkap.streamkap_signal (
  id VARCHAR(255) PRIMARY KEY, 
  type VARCHAR(32) NOT NULL, 
  data VARCHAR(2000) NULL
);

-- Grant necessary privileges on the table to the role
GRANT SELECT, UPDATE, INSERT ON streamkap.streamkap_signal TO streamkap_role;

Enable Change Data Capture

You must have master user privileges to enable CDC.

Enable Change Data Capture on the Database

EXEC msdb.dbo.rds_cdc_enable_db '[<database>]'

Microsoft recommends keeping CDC data files separate from your primary database files

ALTER DATABASE {database} ADD FILEGROUP Streamkap_ChangeTracking;
ALTER DATABASE {database} ADD FILE (
	NAME      = Streamkap_ChangeTracking_Data,
	FILENAME  = N'{path}\Streamkap_ChangeTracking_Data.ndf'
) TO FILEGROUP Streamkap_ChangeTracking;
GO

Enable Change Data Capture for each Table

You must also include here the streamkap_signal table created above

EXEC sys.sp_cdc_enable_table  
@source_schema        = N'{schema}',
@source_name          = N'{table}',
@role_name            = N'streamkap_role',
@filegroup_name       = N'Streamkap_ChangeTracking',
@supports_net_changes = 0
GO

Streamkap Setup

  • Go to Sources and click Create New
  • Input
    • Name for your Connector
    • Endpoint
    • Signal Table Schema - Streamkap will use a collection in this schema to manage snapshots e.g. public. See Enable Snapshots for more information
    • Port (Default 1433)
    • Username (Username you chose earlier, our scripts use streamkap_user)
    • Password
    • Source Database Name
    • Connect via SSH Tunnel. See SSH Tunnel
    • Advanced Parameters
      • Snapshot Mode (Default When Needed) See SQL Server Snapshot Modesserver for more information
      • Represent Binary Data As (Default bytes)
      • Snapshot Chunk Size (Default 1024) - This is the number of rows read at a time when snapshotting. This is a low safe value. As a guide, if you have 100m + rows of data you may want to move this to 5120. If you have 1bn then a higher number still will allow you to backfill faster.
      • Max Batch Size (Default 2048) - A value that specifies the maximum size of each batch of events that the connector processes. Only increase if experiencing lag
    • Add Schemas/Tables. Can also bulk upload here. The format is a simple list of each schema or table per row saved in csv format without a header.
    • Click Save
      The connector will take approximately 1 minute to start processing data.

SQL Server Schema Evolution

When CDC is enabled for a table and changes occur to it, change events are persisted to a 'change table' on the SQL Server database server.

If you introduce a change in the structure of the source table, for example, by adding a new column, that change is not reflected in the change table.

For as long as the change table continues to use the outdated table structure, the Streamkap SQL Server Source is unable to capture change events for the table correctly.

You must intervene to refresh the change table structure because of the way that CDC is implemented in SQL Server.

Refresh change table structure (Online)

🚧

Online refresh limitation

In the interval between the source table structure changing in the source database, and _before _the change table structure is refreshed, change events continue to be captured with the outdated table structure.

For example, if you added a new column to a source table, change events that are captured before the change table is refreshed will not contain the new column.

If this cannot be tolerated, an Offline refresh has to be performed. However, that means downtime for whatever system made structural changes to your tables and your Streamkap pipelines.

Please contact us for assistance if an Offline refresh is required.

The procedure for completing an Online refresh is simpler than running it Offline, and you can complete it without any downtime to your systems and Streamkap pipelines.

For every source table that has changed, copy paste the script below into SQL Server Management Studio, change placeholders as required, and then run all queries.

👍

If you're not sure what {refresh_table} name to use, use {schema}_{table}_v{N}. For example, if the source table is sales.orders then you'd use sales_orders_v2

-- Replace {database}, {schema} and {table} placeholders
-- {database}: name of the CDC enabled database
-- {schema}: name of the schema with tables to refresh
-- {table}: name of the table to refresh
-- {refresh_table}: a unique name for the refreshed change table
USE {database};
GO
EXEC sys.sp_cdc_enable_table
@source_schema        = N'{schema}',
@source_name          = N'{table}',
@role_name            = N'streamkap_role',
@filegroup_name       = N'Streamkap_ChangeTracking',
@supports_net_changes = 0,
@capture_instance     = N'{refresh_table}'
GO

❗️

Refresh table limitation

There cannot be more than 2 change tables for every source table.

After refreshing a change table using the above script, confirm with Streamkap Support that your SQL Server Source has started streaming from the refreshed change table. Once confirmed, disable CDC on the outdated change table. Use the script below to do that.

-- Replace {database}, {schema}, {table} and {refresh_table} placeholders
-- {database}: name of the CDC enabled database
-- {schema}: name of the schema with the table refreshed earlier
-- {table}: name of the table refreshed earlier
-- {refresh_table}: name of the previous refresh table, usually {schema}_{table}
USE {database};
GO
EXEC sys.sp_cdc_disable_table
@source_schema        = N'{schema}',
@source_name          = N'{table}',
@capture_instance     = N'{refresh_table}'
GO

Troubleshooting

SQL Server Setup scripts failing

There can be many reasons for the Setup scripts to fail, but the scripts below can help you diagnose the issues.

-- Replace {database} placeholder
-- {database}: name of the CDC enabled database
USE {database};
GO

SELECT name, database_id, source_database_id, compatibility_level, is_read_only, state, state_desc, is_in_standby, is_cleanly_shutdown, is_cdc_enabled, is_encrypted, replica_id
FROM sys.databases
WHERE name = '{database}' AND is_cdc_enabled=1;

EXEC sys.sp_cdc_help_change_data_capture
GO

If any of the queries return an error or no results:

  • Check you connected to the SQL Server database with a user that has DB_OWNER role privileges
  • If the SELECT ... FROM sys.databases ... returns no results, the database you want Streamkap to capture may not be CDC enabled
  • If the EXEC sys.sp_cdc_help_change_data_capture stored procedure returns no results, the source tables you want Streamkap to capture may not be CDC enabled

If you are still having issues after following the above steps, please don't hesitate to reach out to us.

Two capture instances already exist for source table

If you're getting this error message when refreshing the change table structure, it's because there cannot be more than 2 change tables for every source table.

To fix the problem, 1 of the 2 change tables for the source table need to be disabled. The scripts below can help you do that.

-- Replace {database}, {schema} and {table} placeholders
-- {database}: name of the CDC enabled database
-- {schema}: name of the schema with the table refreshed earlier
-- {table}: name of the table refreshed earlier
USE {database};
GO

EXEC sys.sp_cdc_help_change_data_capture
@source_schema = N'{schema}',
@source_name   = N'{table}'
GO

The above script should return 2 results, the 2 change tables for the {table} specified. Typically you would disable the oldest change table, so use the create_date column to identify the oldest one.

When you've identified the change table to disable, use itssource_schema, source_table and capture_instance names in the query below and execute. Then, try your refresh table script again.

-- Replace {database}, {schema}, {table} and {refresh_table} placeholders
-- {database}: name of the CDC enabled database
USE {database};
GO
EXEC sys.sp_cdc_disable_table
@source_schema        = N'{source_schema}',
@source_name          = N'{source_table}',
@capture_instance     = N'{capture_instance}'
GO