SELECT *
queries.
If your tables aren’t divided into segments (partitioned) and organized into blocks (clustered) based on key column(s) and those key column(s) aren’t referenced in your queries the more data gets processed.
So to get the best out of Streamkap and BigQuery here are some recommendations.
Datasets
Create a separate dataset
Creating a separate dataset for Streamkap helps avoid conflicts with your existing data. For the dataset location, multi-region compared to single-region datasets offers better redundancy but can come at the cost of latency and query performance. You can use multi-region datasets with Streamkap if redundancy is more important than performance, otherwise, use single-region datasets. You can set up table snapshots and regular exports on single-region datasets to improve redundancy.Tables
Set a partition key
Although you can partition by a number, partitioning by a time unit (hour, daily, monthly, yearly) is better. BigQuery is built for analysing data over time so time unit partitioning is the ideal approach. It also means you can set an expiration time on the partitions if you don’t need to retain all data indefinitely. If you’re not sure of a suitable date or timestamp from your data to partition by, consider using the Streamkap change event timestamp. We support all time units for partitioning. To decide which is best, please read Choose Daily, Hourly, Monthly or Yearly Partitioning Google article.Partitioning by change event timestamp and database sources
Our Database Source connectors operate in two modes: snapshot and stream. They also rely mainly on the source database’s binary logs to replicate the data. Usually snapshotting only happens once: when the data is first replicated from source to destination to get the complete data set. That’s because database binary logs usually don’t contain the complete data set. After snapshotting, the connector enters stream mode replicating changes to your data as they happen. This behaviour affects the change event timestamp. A snapshot’s change event timestamp will be the date and time the snapshot was taken. Thereafter, it is the date and time the change to data was made.Most recent record state and database sources
Database binary logs usually represent change events (change data capture). Events are Actions, not the Entities the Actions are taken on. For example, a row in your database could describe an Entity and its most recent state, like a user’s account, a sales order, an advertising campaign and so on. You might insert, update or delete those records. These are Actions. The change event represents the state of the Entity before and after an Action is taken on it. By default, Streamkap doesn’t replay those Actions from binary logs producing the most recent state of the Entity. If the most recent state is important for your data analysis, you can create a View, replacing{ ... }
placeholders as required - assuming you’ve partitioned on an appropriate timestamp: