Databricks Delta Lake

Setup

Get connection details

For the Cluster JDBC URL:

  1. Open Compute page from the sidebar and choose your cluster
  2. Click on Advanced Options
  3. Open the JDBC/ODBC tab
  4. Copy the JDBC Connection URL

For the SQL Warehouse Endpoint:

  1. Open SQL Warehouses page from the sidebar and choose your warehouse
  2. Open the Connection Details tab
  3. Copy the SQL Warehouse Endpoint

Generate an access token

  1. Open Settings page from the sidebar and then User Settings
  2. Open the Personal Access Tokens tab
  3. Click + Generate New Token
  4. (Optional) Enter a comment and change the token lifetime
  5. Click Generate
  6. Copy the access token

Create a temporary directory

  1. Create tmp directory on the Databricks File System (DBFS)

How it works

As data's streamed from the source in to topics (think of them as partitioned tables), the Databricks Sink connector will:

  • Check whether tables for the topics exist in Databricks, if not, it creates them
  • Detect changes between the source data schema and target table schema, and if:
    • a new column (by name) is found, add to the end of the table
    • an existing column's data type is changed, add a new column to the end of the table named <column_name>_<new_data_type_name>
  • Stream change data into Parquet files and upload them to the tmp directory on the Databricks File System (DBFS) and:
    • Load data to the target table using SQL bulk import COPY
    • Clean up the Parquet files