Prerequisites
- ClickHouse User
- ClickHouse database hostname
Delivery Guarantees
Streamkap offers at-least-once delivery guarantee for ClickHouse and defaults toupsert
mode.
For append
(or insert) mode, duplicate records will be inserted into destination tables, but, ClickHouse materialized views can filter them out.
With upsert
mode (default) we carry out deduplication using the records primary key.
See Inserts/Upserts for more information about ingestion modes.
Data duplication
Only applies to
upsert
configured ClickHouse Destinations and destination tables.FINAL
query modifier has drawbacks , we have seen situations where users SELECT
queries on the destination tables return some duplicate records and the FINAL
query modifier helps avoid that.
Behind-the-scenes, upsert
ingestion mode relies on ClickHouse’s ReplacingMergeTree
table engine to deduplicate data in the background instead of when the data’s inserted or merged (e.g. MERGE INTO
for other data warehouses like Snowflake or DataBricks) into the destination tables.
To save having to specify the FINAL
query modifier in every query, you can consider altering the querying user’s role as shown below:
Data Type Support
All mappings are done on a ‘best effort, nearest equivalent’ basis because many data warehouses and data processing systems have unique data types or share data types but interpret them differently.
Streamkap | ClickHouse |
---|---|
INT8 | Int8 |
INT16 | Int16 |
INT32 | Int32 |
FLOAT32 | Float32 |
FLOAT64 | Float64 |
BOOLEAN | Bool |
BYTES | BLOB (String) |
org.apache.kafka.connect.data.Decimal | DECIMAL(38, 0) |
org.apache.kafka.connect.data.Timestamp io.debezium.time.ZonedTimestamp | DateTime64 |
org.apache.kafka.connect.data.Date | Date |
io.debezium.data.Json | String (JSON) or JSON (if supported by ClickHouse server & JSON Support is enabled) |
STRUCT (e.g. maps, key/value pairs) | Tuple |
ARRAY | Array |
ClickHouse Setup
It’s recommended to create a separate user and role for Streamkap to access your ClickHouse database. Below is an example script that does that.Allowing Streamkap access
IP access lists in ClickHouse determine what services can connect to your ClickHouse database. From your ClickHouse Cloud services list:- Choose the service and then Settings
- Click + Add entry and enter Streamkap IP Addresses
- Enter a description e.g. Streamkap app
- Click Add entry
See Setting IP Filters for more information about ClickHouse IP access lists.
Obtaining connection details
You will need the hostname of your ClickHouse database when setting up the Destination in Streamkap. From your ClickHouse Cloud services list:- Choose the service and then Connect
- At the bottom of the connection window you will see
https://
followed by the hostname like thisabc123.us-west-2.aws.clickhouse.cloud
. The part afterhttps://
(excluding the port:port
) is the hostname
Handling semi-structured data
To support complex data structures such as nested arrays of structs and nested structs containing arrays, the Connector’s user role needs to be configured as shown below:Streamkap Setup
- Go to Destinations and choose ClickHouse
-
Input the following information:
- Name - A unique and memorable name for this Connector
- Hostname - The IP or URL of your ClickHouse instance
- Ingestion Mode (default:
upsert
) - See Inserts/Upserts for information about ingestion modes - Port (default:
8443
) - Username (case sensitive) -
STREAMKAP_USER
or the username you chose - Password - Password for the database user
- Database - The name of the database
- SSL - SSL is enabled by default on ClickHouse instances
- Tasks - If Pipelines for this Destination have lag that’s continuously growing over time and not reducing, increase the number of Tasks, otherwise, leave as default
- JSON Support - Allow JSON data type in ClickHouse, make sure the ClickHouse server supports it.
- Click Save