Best Practices
How to keep costs low and queries running fast
Keeping costs low and queries running fast in BigQuery depends mostly on how much data your queries process.
For example, the more columns you reference in your queries the more data gets processed, so it's best to avoid SELECT *
queries.
If your tables aren't divided into segments (partitioned) and organized into blocks (clustered) based on key column(s) and those key column(s) aren't referenced in your queries the more data gets processed.
So to get the best out of Streamkap and BigQuery here are some recommendations.
Datasets
Create a separate dataset
Creating a separate dataset for Streamkap helps avoid conflicts with your existing data.
For the dataset location, multi-region compared to single-region datasets offers better redundancy but can come at the cost of latency and query performance.
You can use multi-region datasets with Streamkap if redundancy is more important than performance, otherwise, use single-region datasets. You can set up table snapshots and regular exports on single-region datasets to improve redundancy.
Tables
Set a partition key
Although you can partition by a number, partitioning by a time unit (hour, daily, monthly, yearly) is better. BigQuery is built for analysing data over time so time unit partitioning is the ideal approach. It also means you can set an expiration time on the partitions if you don't need to retain all data indefinitely.
If you're not sure of a suitable date or timestamp from your data to partition by, consider using the Streamkap change event timestamp.
We support all time units for partitioning. To decide which is best, please read Choose Daily, Hourly, Monthly or Yearly Partitioning Google article.
Partitioning by change event timestamp and database sources
Our Database Source connectors operate in two modes: snapshot and stream. They also rely mainly on the source database's binary logs to replicate the data.
Usually snapshotting only happens once: when the data is first replicated from source to destination to get the complete data set. That's because database binary logs usually don't contain the complete data set. After snapshotting, the connector enters stream mode replicating changes to your data as they happen.
This behaviour affects the change event timestamp. A snapshot's change event timestamp will be the date and time the snapshot was taken. Thereafter, it is the date and time the change to data was made.
Most recent record state and database sources
Database binary logs usually represent change events (change data capture). Events are Actions, not the Entities the Actions are taken on. For example, a row in your database could describe an Entity and its most recent state, like a user's account, a sales order, an advertising campaign and so on. You might insert, update or delete those records. These are Actions.
The change event represents the state of the Entity before and after an Action is taken on it.
By default, Streamkap doesn't replay those Actions from binary logs producing the most recent state of the Entity. If the most recent state is important for your data analysis, you can create a View, replacing { ... }
placeholders as required - assuming you've partitioned on an appropriate timestamp:
CREATE VIEW {dataset}.{view}
OPTIONS(description="Latest record version by id excluding deleted records")
AS SELECT * EXCEPT(dedupe_id)
FROM (
SELECT *,
ROW_NUMBER() OVER
(PARTITION BY {primary_key_column, ...}
ORDER BY {timestamp_column} DESC) AS dedupe_id
FROM {dataset}.{table})
WHERE dedupe_id = 1 -- Latest record
AND __deleted = 'false'; -- Excluding deleted record
There is a caveat though: all the change events that are not the most recent.
Remember that BigQuery cost and performance mostly depend on how much data your queries process. Yes, partitions help reduce cost (assuming your queries use the partition key), but you may want to delete the older unused records to ensure your queries aren't processing them.
Clean Up Old Records
To delete older records - assuming there's an appropriate timestamp such as the change event timestamp - you can schedule a query to delete them.
Here's an example deletion query:
DELETE
FROM {dataset}.{table}
WHERE
STRUCT({primary_key_column, ....}, {timestamp_column})
NOT IN (SELECT AS STRUCT {primary_key_column, ...},
MAX({timestamp_column})
FROM {dataset}.{table}
GROUP BY {primary_key_column, ...})
AND TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), {timestamp_column}, MINUTE) > 90; -- can't modify records that exist in stream buffer for up to 90 minutes
Set a cluster key
This can be the same as your partition key depending on the data. Clustering by one or more columns that you commonly use in aggregation or as filters and have a high cardinality (many distinct values) is best.
If you're not sure of suitable column(s) to use as a cluster key, consider using your table's primary key column(s).
To decide if your partition key is a suitable cluster key, please read Partitioning versus Clustering Google article.
Updated about 1 year ago