Rockset

Stream Change Data Capture (CDC) data into Rockset

Prerequisites

  • A Rockset Medium virtual instance or larger
  • A Rockset account granted at least member built-in role or a custom role that includes the following privileges:
    • CREATE_COLLECTION_INTEGRATION
    • CREATE_INTEGRATION_GLOBAL
    • UPDATE_VI

Limitations

  • You will need to switch between Rockset and Streamkap during setup
  • Existing Streamkap MongoDB and DocumentDB Sources are incompatible with Rockset Destinations. You will need to create new MongoDB and DocumentDB Sources and set:
    • Include Schema? option to No
  • Rockset doesn't automatically create Collections from it's Integrations. You will need to manually create Collections for each Source table (MySQL, PostgreSQL, SQL Server, etc) or collection (Mongo, DocumentDB)
  • Rockset only captures data from the point in time that it was set up. After the Streamkap Pipeline is created and Rockset Integration Setup is completed, you can trigger a snapshot to backfill the historic data
  • Rockset Batch Size has to be manually adjusted to ensure data ingestion does not hit Rockset's request limits which are capped at 10 MiB and 20,000 documents per request

Setup

Create a Rockset Integration

  1. Add a new Rockset Kafka Integration and click Start
  2. Give the Integration a memorable Integration Name and Description (optional)
  3. For where your Kafka cluster is hosted, choose Apache Kafka

🚧

Before you continue

Make sure that for the Source who's data you want to stream to Rockset you have the:

  • Streamkap Source ID
  • Streamkap Topic name(s) for the tables (MySQL, PostgreSQL, SQL Server, etc) or collections (Mongo, DocumentDB)

Together with the source_ prefix, they make up the full name of the topics to be streamed to Rockset e.g. source_abcdefg123456.mySchema.myTable

If you're not sure how to get these, please see the Troubleshooting section

  1. Choose Data Format based on the Source type:
    1. JSON: if your Source is MongoDB or DocumentDB
    2. AVRO: for all other Sources
  2. For Kafka Topics, enter the full name - including the Source's Streamkap ID - of each table or collection e.g. if your Source's ID is abcdefg123456 and the Streamkap Topic's name is mySchema.myTable, the full name would be source_abcdefg123456.mySchema.myTable
  3. After you have entered the name(s) of each table or collection, click Save Integration and Continue
  4. For what type of setup your Kafka Connect cluster is, choose Distributed
  5. Under Step 4: Configure the Rockset Sink Connector you will see a JSON configuration, copy and paste that somewhere. You will need the values of these properties for creating a Streamkap Pipeline:
    1. format
    2. rockset.apiserver.url
    3. rockset.integration.key

🚧

Don't close this Rockset Integration Setup page. You will be coming back to it later on.

Create a Streamkap Pipeline

  1. Add a new Streamkap Rockset Destination and enter the following information:
    1. Name - A memorable name for this Connector
    2. Rockset API URL - The rockset.apiserver.url property value from step 9
    3. Rockset Integration Key - The rockset.integration.key property value from step 9
    4. Format - The format property value from step 9
  2. After you have entered the information, click Save
  3. Add a new Streamkap Pipeline, giving it a memorable name and selecting the Source, and the newly created Rockset Destination
  4. Click Next
  5. Choose the schema(s) and table(s) you want to stream to Rockset, then click Save

Complete Rockset Integration Setup

  1. Go back to the new Rockset Kafka Integration you are creating
  2. Under Step 5: Check if the data is coming through click Refresh and wait a few moments. Then, verify that for each topic listed, the status is Active
  3. If the statuses are Active, click Complete Integration Setup
  4. In the top right corner, click Create Collection from Integration, then:
    1. Enter the full name of the of the table or collection from this Rockset Kafka Integration and click Next
    2. (Optional) If you need to transform the data prior to ingestion into a Rockset collection, enter your transformation SQL query, otherwise, click Next
    3. Choose the Workspace to create the Rockset collection in
    4. Enter a Collection Name and Description (optional) for the Rockset collection
    5. (Optional): Change the Ingest Limit, Retention Policy and Data Compression
  5. After you have entered the information, click Create

You will need to repeat steps 4 and 5 for each table or collection from this Rockset Kafka Integration.

Troubleshooting

Get the Streamkap Source ID

  1. Go to the Sources page and click on the Source name to view its details
  2. In the URL e.g. app.streamkap.com/sources/abcdefg123456, you will see the Source's ID e.g. abcdefg123456

Get the Streamkap Topic names

  1. Go to the Sources page and click on the Source name to view its details
  2. At the bottom of the page the topic names are listed e.g. mySchema.myTable

The source_ prefix, Source ID and Topic name combined make up the full name required by the Rockset Kafka Integration e.g. source_abcdefg123456.mySchema.myTable

Unable to write document to Rockset

If you see this error message in the logs for your Rockset Destination, it is most likely due to Rockset's request limits which are capped at 10 MiB and 20,000 documents per request.

To resolve, find and edit your Rockset Destination and reduce the Batch Size. If the error message persists, continue to reduce the batch size until the Destination shows as Active.

Reducing lag

If you're having to reduce batch size of your Rockset Destination due to Rockset's request limits, lag may increase depending on the size and volume of your data.

To work around this, clone your existing Rockset Destination and create new Pipelines for the topics with the most lag. You can then control the Batch Size for these topics and attempt to increase the batch size to reduce lag.