Rockset
Stream Change Data Capture (CDC) data into Rockset
Prerequisites
- A Rockset Medium virtual instance or larger
- A Rockset account granted at least
member
built-in role or a custom role that includes the following privileges:CREATE_COLLECTION_INTEGRATION
CREATE_INTEGRATION_GLOBAL
UPDATE_VI
Limitations
- You will need to switch between Rockset and Streamkap during setup
- Existing Streamkap MongoDB and DocumentDB Sources are incompatible with Rockset Destinations. You will need to create new MongoDB and DocumentDB Sources and set:
- Include Schema? option to No
- Rockset doesn't automatically create Collections from it's Integrations. You will need to manually create Collections for each Source table (MySQL, PostgreSQL, SQL Server, etc) or collection (Mongo, DocumentDB)
- Rockset only captures data from the point in time that it was set up. After the Streamkap Pipeline is created and Rockset Integration Setup is completed, you can trigger a snapshot to backfill the historic data
- Rockset Batch Size has to be manually adjusted to ensure data ingestion does not hit Rockset's request limits which are capped at 10 MiB and 20,000 documents per request
Setup
Create a Rockset Integration
- Add a new Rockset Kafka Integration and click Start
- Give the Integration a memorable Integration Name and Description (optional)
- For where your Kafka cluster is hosted, choose Apache Kafka
Before you continue
Make sure that for the Source who's data you want to stream to Rockset you have the:
- Streamkap Source ID
- Streamkap Topic name(s) for the tables (MySQL, PostgreSQL, SQL Server, etc) or collections (Mongo, DocumentDB)
Together with the
source_
prefix, they make up the full name of the topics to be streamed to Rockset e.g.source_abcdefg123456.mySchema.myTable
If you're not sure how to get these, please see the Troubleshooting section
- Choose Data Format based on the Source type:
JSON
: if your Source is MongoDB or DocumentDBAVRO
: for all other Sources
- For Kafka Topics, enter the full name - including the Source's Streamkap ID - of each table or collection e.g. if your Source's ID is
abcdefg123456
and the Streamkap Topic's name ismySchema.myTable
, the full name would besource_abcdefg123456.mySchema.myTable
- After you have entered the name(s) of each table or collection, click Save Integration and Continue
- For what type of setup your Kafka Connect cluster is, choose Distributed
- Under Step 4: Configure the Rockset Sink Connector you will see a JSON configuration, copy and paste that somewhere. You will need the values of these properties for creating a Streamkap Pipeline:
format
rockset.apiserver.url
rockset.integration.key
Don't close this Rockset Integration Setup page. You will be coming back to it later on.
Create a Streamkap Pipeline
- Add a new Streamkap Rockset Destination and enter the following information:
- Name - A memorable name for this Connector
- Rockset API URL - The
rockset.apiserver.url
property value from step 9 - Rockset Integration Key - The
rockset.integration.key
property value from step 9 - Format - The
format
property value from step 9
- After you have entered the information, click Save
- Add a new Streamkap Pipeline, giving it a memorable name and selecting the Source, and the newly created Rockset Destination
- Click Next
- Choose the schema(s) and table(s) you want to stream to Rockset, then click Save
Complete Rockset Integration Setup
- Go back to the new Rockset Kafka Integration you are creating
- Under Step 5: Check if the data is coming through click Refresh and wait a few moments. Then, verify that for each topic listed, the status is Active
- If the statuses are Active, click Complete Integration Setup
- In the top right corner, click Create Collection from Integration, then:
- Enter the full name of the of the table or collection from this Rockset Kafka Integration and click Next
- (Optional) If you need to transform the data prior to ingestion into a Rockset collection, enter your transformation SQL query, otherwise, click Next
- Choose the Workspace to create the Rockset collection in
- Enter a Collection Name and Description (optional) for the Rockset collection
- (Optional): Change the Ingest Limit, Retention Policy and Data Compression
- After you have entered the information, click Create
You will need to repeat steps 4 and 5 for each table or collection from this Rockset Kafka Integration.
Troubleshooting
Get the Streamkap Source ID
- Go to the Sources page and click on the Source name to view its details
- In the URL e.g.
app.streamkap.com/sources/abcdefg123456
, you will see the Source's ID e.g.abcdefg123456
Get the Streamkap Topic names
- Go to the Sources page and click on the Source name to view its details
- At the bottom of the page the topic names are listed e.g.
mySchema.myTable
The source_
prefix, Source ID and Topic name combined make up the full name required by the Rockset Kafka Integration e.g. source_abcdefg123456.mySchema.myTable
Unable to write document to Rockset
If you see this error message in the logs for your Rockset Destination, it is most likely due to Rockset's request limits which are capped at 10 MiB and 20,000 documents per request.
To resolve, find and edit your Rockset Destination and reduce the Batch Size. If the error message persists, continue to reduce the batch size until the Destination shows as Active.
Reducing lag
If you're having to reduce batch size of your Rockset Destination due to Rockset's request limits, lag may increase depending on the size and volume of your data.
To work around this, clone your existing Rockset Destination and create new Pipelines for the topics with the most lag. You can then control the Batch Size for these topics and attempt to increase the batch size to reduce lag.
Updated 9 months ago